如何在Kubernetes上实现LLM分布式推理SOTA性能？llm-d v0.5实测50k tok/s

Q: llm-d 如何实现高性能推理？

llm-d 通过集成 vLLM、Kubernetes Gateway API 以及分离式推理、前缀缓存感知路由、分层 KV 缓存等技术，在各种加速器上实现 SOTA 推理性能。

Introduction

llm-d is a high-performance distributed inference serving stack optimized for production deployments on Kubernetes. We help you achieve the fastest "time to state-of-the-art (SOTA) performance" for key OSS large language models across most hardware accelerators and infrastructure providers with well-tested guides and real-world benchmarks.

llm-d 是一个高性能分布式推理服务栈，专为 Kubernetes 上的生产环境部署而优化。我们通过经过充分测试的指南和真实世界基准测试，帮助您在大多数硬件加速器和基础设施提供商上，为关键的开源大语言模型实现最快的“达到最先进（SOTA）性能的时间”。

What Does llm-d Offer to Production Inference?

Model servers like vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。 and SGLang handle efficiently running large language models on accelerators. llm-d provides state of the art orchestration above model servers to serve high-scale real world traffic efficiently and reliably:

像 vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。和 SGLang 这样的模型服务器负责在加速器上高效运行大语言模型。llm-d 在模型服务器之上提供了最先进的编排能力，以高效、可靠地服务于大规模真实世界流量：

Key Features

Optimized Baseline - Deploy vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。 behind a Gateway API-based load balancer enhanced with an inference scheduler to decrease serving latency and increase throughput with prefix-cache aware routing, utilization-based load balancing, fairness and prioritization for multi-tenant serving, and predicted latency balancing (experimental).

优化基线 - 将 vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。部署在基于 Gateway API 的负载均衡器之后，并配备推理调度器，通过前缀缓存感知路由、基于利用率的负载均衡、多租户服务的公平性和优先级排序，以及预测延迟均衡（实验性），来降低服务延迟并提高吞吐量。

Disaggregated Serving将推理过程拆分为prefill（预填充）和decode（解码）两个阶段，分别由独立服务器处理，以降低首token延迟。 (Prefill/Decode Disaggregation) - Reduce time to first token (TTFT) and get more predictable time per output token (TPOT) by splitting inference into prefill servers handling prompts and decode servers handling responses, primarily on large models such as gpt-oss-120b and when processing very long prompts.

分离式服务（预填充/解码分离） - 通过将推理拆分为处理提示的预填充服务器和处理响应的解码服务器，降低首令牌时间（TTFT），并获得更可预测的每输出令牌时间（TPOT），主要适用于大型模型（如 gpt-oss-120b）以及处理超长提示的场景。

Wide Expert-Parallelism - Deploy very large Mixture-of-Experts (MoE) models like DeepSeek-R1 for much higher throughput for RL and latency-insensitive workloads, using Data Parallelism and Expert Parallelism over fast accelerator networks.

宽专家并行 - 部署超大型混合专家（MoE）模型（如 DeepSeek-R1），通过快速加速器网络上的数据并行和专家并行，为强化学习和延迟不敏感的工作负载提供更高的吞吐量。

Tiered KV Prefix Caching with CPU and Storage Offload - Improve prefix cache缓存LLM推理中公共前缀的KV状态，避免重复计算，加速相似请求的处理。 hit rate by offloading KV-cache entries to CPU memory, local SSD, and remote high-performance filesystem storage.

分层 KV 前缀缓存与 CPU 及存储卸载 - 通过将 KV 缓存条目卸载到 CPU 内存、本地 SSD 和远程高性能文件系统存储，提高前缀缓存命中率。

Workload Autoscaling - Autoscale multi-model workloads on heterogeneous shared hardware with SLO-aware cost optimization using the Workload Variant Autoscaler or autoscale workloads on homogeneous hardware where each model scales independently using HPA with EPP metrics.

工作负载自动缩放 - 在异构共享硬件上，使用工作负载变体自动缩放器，通过 SLO 感知的成本优化来自动缩放多模型工作负载；或在同构硬件上，使用带有 EPP 指标的 HPA，使每个模型独立缩放。

These guides provide tested and benchmarked recipes and Helm charts to start serving quickly with best practices common to production deployments. They are extensible and customizable for particulars of your models and use cases, using standard open source components like Kubernetes, Kubernetes Gateway APIKubernetes中用于管理南北向流量的API，llm-d利用其实现智能负载均衡和路由。, NIXL, and vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。. Our intent is to eliminate the heavy lifting common in tuning and deploying generative AI inference on modern accelerators.

这些指南提供了经过测试和基准测试的配方和 Helm 图表，帮助您快速开始服务，并遵循生产部署的常见最佳实践。它们可针对您的模型和用例的特定需求进行扩展和定制，使用标准的开源组件，如 Kubernetes、Kubernetes Gateway APIKubernetes中用于管理南北向流量的API，llm-d利用其实现智能负载均衡和路由。、NIXL 和 vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。。我们的目标是消除在现代加速器上调优和部署生成式 AI 推理时常见的繁重工作。

Get Started Now

We recommend new users start with a deployment of optimized baseline.

我们建议新用户从部署优化基线开始。

Note: We are currently revamping our documentation. You can also preview our new quickstarts, which will be formally released soon.

注意： 我们目前正在改进文档。您也可以预览我们的新快速入门指南，这些指南将很快正式发布。

Latest News 🔥

[2026-02] v0.5 - Introduces reproducible benchmark workflows, hierarchical KV offloading, cache-aware LoRA routing, active-active HA, UCCL-based transport resilience, and scale-to-zero autoscaling; validated ~3.1k tok/s per B200 decode GPU (wide-EP) and up to 50k output tok/s on a 16×16 B200 prefill/decode topology with order-of-magnitude TTFT reduction vs round-robin baseline.

[2026-02] v0.5 - 引入了可复现的基准测试工作流、分层 KV 卸载、缓存感知的 LoRA 路由、主-主高可用、基于 UCCL 的传输弹性以及缩放到零的自动缩放；验证了每个 B200 解码 GPU 约 3.1k tok/s（宽专家并行），以及在 16×16 B200 预填充/解码拓扑上高达 50k 输出 tok/s，与轮询基线相比，TTFT 降低了数量级。

[2025-12] v0.4 - Demonstrates 40% reduction in per output token latency for DeepSeek V3.1 on H200 GPUs, Intel XPU and Google TPU disaggregation support for lower time to first token, a new well-lit path for prefix cache缓存LLM推理中公共前缀的KV状态，避免重复计算，加速相似请求的处理。 offload to vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。-native CPU memory tiering, and a preview of the workload variant autoscaler improving model-as-a-service efficiency.

[2025-12] v0.4 - 展示了在 H200 GPU 上 DeepSeek V3.1 的每输出令牌延迟降低了 40%，支持 Intel XPU 和 Google TPU 的分离式服务以降低首令牌时间，为前缀缓存卸载到 vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。原生 CPU 内存分层提供了新的清晰路径，以及工作负载变体自动缩放器的预览，提高了模型即服务的效率。

🧱 Architecture

llm-d accelerates distributed inference by integrating industry-standard open technologies: vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。 as default model server and engine, Kubernetes Inference Gateway as control plane API and load balancing orchestrator, and Kubernetes as infrastructure orchestrator and workload control plane.

llm-d 通过集成行业标准的开放技术来加速分布式推理：vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。作为默认的模型服务器和引擎，Kubernetes Inference Gateway 作为控制平面 API 和负载均衡编排器，以及 Kubernetes 作为基础设施编排器和工作负载控制平面。

llm-d Adds:

Model Server Optimizations in vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。 - The llm-d team contributes and maintains high performance distributed serving optimizations in upstream vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。, including disaggregated serving将推理过程拆分为prefill（预填充）和decode（解码）两个阶段，分别由独立服务器处理，以降低首token延迟。, KV connector interfaces, support for frontier OSS mixture of experts models, and production-ready observability and resiliency.

vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。中的模型服务器优化 - llm-d 团队在上游 vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。中贡献并维护高性能分布式服务优化，包括分离式服务、KV 连接器接口、对前沿开源混合专家模型的支持，以及生产就绪的可观测性和弹性。

Inference Scheduler - llm-d uses compatible Gateway implementations and their extensible balancing policies to make customizable "smart" load-balancing decisions specifically for LLMs without reimplementing a full-featured load balancer. Leveraging operational telemetry, the Inference Scheduler implements the filtering and scoring algorithms to make decisions with P/D-awareness, KV-cache-awareness, SLA-awareness, and load-awareness.

推理调度器 - llm-d 使用兼容的 Gateway 实现及其可扩展的均衡策略，为 LLM 做出可定制的“智能”负载均衡决策，而无需重新实现一个功能完整的负载均衡器。利用运行遥测数据，推理调度器实现了过滤和评分算法，以做出具有 P/D 感知、KV 缓存感知、SLA 感知和负载感知的决策。

Disaggregated Serving将推理过程拆分为prefill（预填充）和decode（解码）两个阶段，分别由独立服务器处理，以降低首token延迟。 Sidecar - llm-d orchestrates prefill and decode phases onto independent instances - the scheduler decides which instances should receive a given request, and the transaction is coordinated via a sidecar alongside decode instances. The sidecar instructs vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。 to provide point to point KV cache transfer over fast interconnects (IB/RoCE RDMA, TPU ICI, and DCN) via NIXL.

分离式服务 Sidecar - llm-d 将预填充和解码阶段编排到独立的实例上——调度器决定哪些实例应接收给定请求，并通过解码实例旁边的 sidecar 协调事务。该 sidecar 指示 vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。通过 NIXL 在快速互连（IB/RoCE RDMA、TPU ICI 和 DCN）上提供点对点 KV 缓存传输。

vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。 Native CPU Offloading and llm-d Filesystem Backend - llm-d uses vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。's KVConnector abstraction to configure a pluggable KV cache hierarchy, including offloading KVs to host, remote storage, and systems like LMCache, Mooncake, and KVBM.

vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。原生 CPU 卸载和 llm-d 文件系统后端 - llm-d 使用 vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。的 KVConnector 抽象来配置可插拔的 KV 缓存层次结构，包括将 KV 卸载到主机、远程存储以及 LMCache、Mooncake 和 KVBM 等系统。

Variant Autoscaling over Hardware, Workload, and Traffic - A traffic- and hardware-aware autoscaler that (a) measures the capacity of each model server instance, (b) derives a load function that takes into account different request shapes and QoS, and (c) assesses recent traffic mix (QPS, QoS, and shapes) to calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, enabling use of HPA for SLO-level efficiency.

基于硬件、工作负载和流量的变体自动缩放 - 一个流量和硬件感知的自动缩放器，它 (a) 测量每个模型服务器实例的容量，(b) 推导出一个考虑不同请求形状和服务质量的负载函数，以及 (c) 评估最近的流量组合（QPS、QoS 和形状），以计算处理预填充、解码和延迟容忍请求的最佳实例组合，从而能够使用 HPA 实现 SLO 级别的效率。

What Is in Scope for llm-d

llm-d currently targets improving the production serving experience around:

llm-d 目前的目标是改善以下方面的生产服务体验：

Online serving and online batch of Generative models running in PyTorch or JAX
- Large language models (LLMs) with 1 billion or more parameters
- Using most or all of the capacity of one or more hardware accelerators
- Running in throughput, latency, or multiple-objective configurations

在 PyTorch 或 JAX 中运行的生成模型的在线服务和在线批处理

具有 10 亿或更多参数的大语言模型（LLM）

使用一个或多个硬件加速器的大部分或全部容量

在吞吐量、延迟或多目标配置中运行

On recent generation datacenter-class accelerators - NVIDIA A100+, AMD MI250, Google TPU v5e or newer, and Intel GPU Max series or newer
On Kubernetes 1.29+, integrated via code into Ray, or as a standalone service

在最新一代数据中心级加速器上——NVIDIA A100+、AMD MI250、Google TPU v5e 或更新、Intel GPU Max 系列或更新

在 Kubernetes 1.29+ 上，通过代码集成到 Ray 中，或作为独立服务运行

🔍 Observability

Monitoring & Metrics - Prometheus, Grafana dashboards, and PromQL queries
Distributed Tracing - OpenTelemetry tracing across vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。, routing proxy, and EPP

监控与指标 - Prometheus、Grafana 仪表板和 PromQL 查询

分布式追踪 - 跨 vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。、路由代理和 EPP 的 OpenTelemetry 追踪

📦 Releases

Our guides are living docs and kept current. For details about the Helm charts and component releases, visit our GitHub Releases page to review release notes.

我们的指南是动态文档，并保持最新。有关 Helm 图表和组件发布的详细信息，请访问我们的 GitHub Releases 页面查看发布说明。

Contribute

See our project overview for more details on our development process and governance.
Review our contributing guidelines for detailed information on how to contribute to the project.
Join one of our Special Interest Groups (SIGs) to contribute to specific areas of the project and collaborate with domain experts.
We use Slack to discuss development across organizations. Please join: Slack
We host a bi-weekly standup for contributors every other Wednesday at 12:30 PM ET, as well as meetings for various SIGs. You can find them in the shared llm-d calendar.
We use Google Groups to share architecture diagrams and other content. Please join: Google Group

请参阅我们的项目概览，了解有关开发流程和治理的更多详细信息。

请查阅我们的贡献指南，了解如何为项目做出贡献的详细信息。

加入我们的一个特别兴趣小组（SIG），为项目的特定领域做出贡献并与领域专家合作。

我们使用 Slack 讨论跨组织的开发工作。请加入：Slack

我们每隔一周的周三美国东部时间下午 12:30 举办一次双周贡献者例会，以及各 SIG 的会议。您可以在共享的 llm-d 日历中找到它们。

我们使用 Google Groups 共享架构图和其他内容。请加入：Google Group

License

This project is licensed under Apache License 2.0. See the LICENSE file for details.

本项目采用 Apache License 2.0 许可。详情请参阅 LICENSE 文件。

常见问题（FAQ）

llm-d 如何实现高性能推理？

llm-d 通过集成 vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。、Kubernetes Gateway APIKubernetes中用于管理南北向流量的API，llm-d利用其实现智能负载均衡和路由。以及分离式推理、前缀缓存感知路由、分层 KV 缓存等技术，在各种加速器上实现 SOTA 推理性能。

llm-d 支持哪些高级推理特性？

llm-d 支持分离式服务（预填充/解码分离）、宽专家并行、分层 KV 前缀缓存卸载、工作负载自动缩放等特性，优化延迟和吞吐量。

llm-d 的 v0.5 版本有哪些新功能？

v0.5 引入了可复现基准测试工作流、分层 KV 卸载、缓存感知 LoRA 路由、主备高可用和 UCCL 传输，在 16×16 B200 上达 50k tok/s。

AI Summary (BLUF)