State of the Model Serving Communities - November 2025
Most recent updates from several AI/ML model inference communities that our teams at Red Hat AI are contributing to.
Hi everyone,
This newsletter will provide you with recent updates on various model serving communities, keep you informed about Red Hat AI’s contributions to upstream communities, and foster collaborations across teams and organizations.
In case you missed it, last month we launched this newsletter publicly on Substack and now we have 800+ subscribers already! Feel free to share with others who might be interested in receiving future updates.
Contributors: Pete Cheslock, Sasa Zelenovic, Wentao Ye, Nick Hill, Nir Rozenbaum, Jooho Lee, Yuan Tang
Executive Summary
Community Outreach: Red Hat will be at KubeCon + CloudNativeCon North America in Atlanta the week of November 10th at booth #100. The vLLM community has seen significant activity, with LLM Compressor surpassing 30,000 weekly installs, successful office hours, and numerous meetups globally, including upcoming events in Zurich, Paris, Bangkok, Hyderabad, Malaysia, and San Diego. The llm-d community has presented at AMD Dev Day and PyTorch Conf, is seeking feedback for its v0.4 release, and has upcoming talks at Cloud Native + Kubernetes AI Day and KubeCon.
WG Serving: WG co-chairs will deliver a KubeCon session titled “Navigating the Rapid Evolution of Large Model Inference: Where does Kubernetes Fit?”. GIE released v1.1.0 with experimental features like Flow Control, multi-port support, and multi-cluster support. Serving Catalog added LMCache and GCSFuse components, and is adding support for AKS & EKS templates. Inference Perf team is working on release v1.0.0, which will include synthetic multiturn chat support, dataset trace replay support, and CI/CD and testing.
KServe: KServe will be featured in a keynote session at KubeCon Atlanta and a dedicated session on Cloud Native AI Day. A CNCF announcement blog about KServe joining CNCF as an incubating project is scheduled to be posted during KubeCon. Notable community PRs includes the addition of a new time-series forecast API, progressive rollout for raw deployments, and the refactoring of LLMISvc manifests to enable modular installs and improve maintainability and clarity.
llm-d: Updates from various SIGs from the llm-d community:
Inference Scheduler: Moved sidecar code, and has proposals open for comments on EPP as a Standalone Request Scheduler, an extension to the pluggable framework, and serving online batch via inference gateway.
Benchmarking: Integrated with WVA, offers Configuration Explorer for evaluation, removed fmperf support for InferenceMAX, and integrated capacity planner into standup.sh.
PD Disaggregation: Updates on Wide EP (fixed CUDA_VISIBLE_DEVICES issue, enabled PD transfers, added a basic test, in-progress All2All kernels and GB200 NVL72), Elastic EP milestone 2 PR, and llm-d 0.4 Pareto chart automation.
KV Disaggregation: Plans to add LMCache and native CPU offloading, proposals for KV cache storage and CPU medium support, an RFC on llm-d native storage connector, and updates on Valkey, RDMA, gRPC Service, and UDS-based external tokenizer service.
Installation: Modularized reusable recipes, added umbrella KV cache offloading folder structure, verified AKS well-lit path, and added hardware integration/enablement and uninstallation documentation.
Autoscaling: Started integration with llm-d-inference-simulator, added installation scripts for 0.4, refactoring the VariantAutoscaling CRD, added TLS verification and logging for prod, and has proposals open for comments on scale-to/from zero and metrics for scheduler-to-autoscaler information exchange.
Observability: The Tracing SDK was merged into GIE, documentation for enabling tracing in vLLM & EPP was added, dashboards were updated, and EPP can now register metrics from extensions.
vLLM: vLLM completes V0 deprecation and unifies on stable V1 OpenAI‑compatible APIs (token‑ID returns, extra_body/metadata), while strengthening observability (StatLogger, tracing) and execution (async scheduling, DCP, torch.compile, CUDA Graphs). Expanded model/multimodal and backend coverage (TPU PyTorch/JAX, Intel XPU, ROCm), FP8/DeepGEMM/Blackwell, and batch‑invariant determinism—plus zero‑reload Sleep Mode and modernized docs—make vLLM a production‑ready, high‑performance choice for heterogeneous AI inference.
Llama Stack (Inference): Recent releases v0.3.1 and v0.3.0 have introduced several key improvements, including stable OpenAI-compatible APIs, a clear separation of APIs into stable, experimental, and deprecated categories, and support for extra_body/metadata in APIs to offer more functionality than the standard OpenAI implementation. Additionally, the documentation has been significantly overhauled, now utilizing Docusaurus for modern formatting and improved API documentation. These releases were made possible by over 30 contributors, including 8 new contributors.
Community Outreach
Visit Red Hat at KubeCon + CloudNativeCon North America in Atlanta in the week of Nov 10th.
Visit the Red Hat booth #100 onsite in the sponsor showcase.
Check out all the sessions and events from Red Hat.
vLLM community
LLM Compressor surpasses 30,000 weekly installs, with 34,731 last week.
Two vLLM office hours:
Brought vLLM office hours to China, with 10,700 people attending the first session
Meetups:
Beijing vLLM meetup, with 41,000 virtual and 352 in-person attendees.
Tokyo vLLM meetup executed in October.
Upcoming meetups In November:
Zurich: 300 registered and will be live streamed here
Paris: just opened and aiming for 300
Bangkok, Thailand: 180 registered
Hyderabad, India: just opened
Malaysia: planning for early December
San Diego: vLLM party during NeurIPS
llm-d community
The community has been active at recent conferences, presenting talks at AMD Dev Day and PyTorch Conf in San Francisco.
Feedback needed for v0.4 release: As we progress toward our v0.4 release, we are refining our “well-lit path” configurations. We are seeking community feedback on the new Kustomize-based method for deploying the “Wide-EP LWS” pattern and how it compares to the existing Helmfiles-based deployment.
Upcoming talks: Look out for many new llm-d presentations at the upcoming Cloud Native + Kubernetes AI Day and KubeCon. For a full schedule, see the community events page.
WG Serving
Communications: #wg-serving channel in Kubernetes Slack and mailing list
Workstreams and subprojects updates (note: you’ll need to join the WG Serving mailing list to access some of the documents)
Overall organization updates:
Check out our KubeCon session next week from the WG co-chairs: Navigating the Rapid Evolution of Large Model Inference: Where does Kubernetes Fit?
Subprojects updates:
Gateway API Inference Extension (GIE):
Release v1.1.0 is out with a focus on sharing and enabling users to try experimental features:
Flow Control is available as an experimental feature.
Multi-port support is available with GW implementations that also support this. This enables sophisticated features like Wide EP. GW providers support forthcoming.
The API surface has been extended to experimentally support multi-cluster support.
Added LMCache component for KVCache offloading
Added GCSFuse components for efficient model weight loading
Adding support for AKS & EKS templates
Working on release v1.0.0
Adding support for synthetic multiturn chat
Dataset trace replay support
CI/CD and testing
KServe
Communications: #kserve channel in the CNCF Slack
KServe will be featured in a keynote session at KubeCon Atlanta and a dedicated session on Cloud Native AI Day. The CNCF announcement blog about KServe joining CNCF as an incubating project is scheduled to be posted during KubeCon.
Community PRs worth paying attention to:
Adds a new /v1/timeseries/forecast endpoint for time-series inference (univariate/multivariate).
Supports quantiles, schema validation, and metadata.
Makes KServe ready for time-series ML workloads (e.g., IoT, finance).
Progressive Rollout for Raw Deployment
Introduces rollout strategies (Availability / ResourceAware) for model updates.
Simplifies deployment configuration compared to raw Kubernetes fields (maxSurge, maxUnavailable).
Improves control over model rollout safety and uptime.
Separates LLMISvc manifests and kustomizations from the default KServe setup.
Enables modular installs (LLMISvc-only or full KServe).
Improves maintainability and clarity in installation flows.
llm-d
Communications: Slack, mailing list, and community meeting
SIG updates (note: you’ll need to join the llm-d mailing list to access the documents):
Moved sidecar code to inference-scheduler repo
EPP as a Standalone Request Scheduler proposal is open for comments
Extension to the existing pluggable framework proposal is open for comments
Proposal for serving online batch via inference gateway is open for comments
Integrated with WVA (Workload Variant Autoscaler)
Configuration Explorer is available for evaluation
Removed support for fmperf and added support for InferenceMAX
Capacity planner integrated into standup.sh
Wide EP updates:
Fixed issue with CUDA_VISIBLE_DEVICES
Enabled PD transfers with Prefill TP to Decode TP
Added basic single node dual batch overlap test
All2Alls is new in PyTorch 2.9. Hybrid DeepEP All2All kernels is in progress
GB200 NVL72 is in progress
Elastic EP milestone 2 PR
llm-d 0.4 Pareto chart automation for InferenceMax Comparison
Plan to add both LMCache and native CPU offloading and evaluation
Proposal to add a well-lit paths for KV cache storage
RFC on llm-d native storage connector is open for comments
Proposal to add CPU medium support in inference-scheduler
RFC on cache hit threshold to handle preemptions in PD-Disaggregation and enable lightweight powerful P/D implementations
Add Valkey and RDMA support for KV-cache indexing
Modularize reusable recipes for user guides
Adding umbrella kv cache offloading well-lit path folder structure
Verified AKS well-lit path
Added hardware integration and enablement docs and uninstallation docs
Started working on the integration with llm-d-inference-simulator
Added installation scripts and detailed guide for 0.4
Refactoring the VariantAutoscaling CRD from a multi-variant to a single-variant architecture
Added proper TLS verification and logging levels for prod
Scale-to/from zero implementation proposal is open for comments
Proposal for metrics for scheduler-to-autoscaler information exchange is open for comments
The Tracing SDK has been merged in GIE
Added documentation on how to enable tracing in vLLM & EPP
Updated and improved dashboards
Allow EPP to register metrics from extension
vLLM
Communications: Slack
Articles and blog posts
vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU
No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL
From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA
Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2’s Tool-Calling on vLLM
Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM
Notable additions since the last update:
Engine and APIs
Removed remaining V0 usage flags to complete the V0 deprecation and unify on V1
Refactored to unify shared logic between Chat Completions and Responses
Enabled StatLogger in `LLMEngine` for better observability
Removed deprecated CLI options `ropescaling` and `ropetheta`.
Scheduling and execution
AsyncScheduling: prevent scheduling past perrequest `max_tokens` .
Perf: avoid separate thread for MP executor SHM spin
Fix incorrect `preallocated sampled_token_ids` tensor size
Decode Context Parallel (DCP) robustness
Check `return_lse` across all layers.
Fix `dcp_local_seq_lens` calculation.
Address DCP assert on `reorder_batch_threshold`.
Reasoning parsers and structured outputs
Added Kimi reasoning parser.
Fixed DeepSeekR1 ReasoningParser import .
Reduced logging noise for MinimaxM2 ToolParser import success message .
Lazyloaded `reasoning_parser` to reduce startup overhead.
Models and multimodal
Granite Speech support and LoRA for STT.
Added support for `openPangu_Ultra_MoE`.
NemotronH: optimal Triton fused MoE configs and pipeline parallel fix.
Qwen3Next: MoE configs for A100SXM480GB TP4/TP8.
Transformers backend: fixed encoder-only model support.
Multimodal: made `MediaConnector` extensible.
Backends and hardware
Intel XPU: IPEX custom routing functions for Llama4.
Intel XPU: added GPTOSS model support for Intel GPU.
ROCm: upstreamed `gemm_a16w16`; redesigned ROCm AITER MHA backend for perf.
CUTLASS/SM100: swapAB optimization for FP8 GEMM; fixed FP8 FusedMoE scaling factors .
GDN path: decoupled projections from custom op; later reverted in.
Graph/Kernel/Weights
Graph partition/cache: use Inductor partition ops config.
Kernel code isolation for FusedMoE method base .
Support using `Int4PreshuffledTensor` after loading.
Batch-invariant deterministic inference
Kernel override determinism foundation
FlashInfer backend kernel override (unrevert)
DeepSeek-V3 batch invariant on 8×H100
DeepGEMM + Blackwell support
Batch invariant for R1 TP8 on Blackwell
Torch compile & CUDA Graphs support
Llama Stack (Inference)
Communications:
Community office hours happen weekly Thursdays at 12pm EST on Discord
Recent releases (changelog): v0.3.1 and v0.3.0, by 30+ contributors, including 8 new contributors. Notable changes related to inference include:
Stable OpenAI-Compatible APIs
Llama Stack now separates APIs into stable (/v1/), experimental (/v1alpha/ and /v1beta/) and deprecated (deprecated = True.)
extra_body/metadata support for APIs which support extra functionality compared to the OpenAI implementation
Documentation overhaul: Migration to Docusaurus, modern formatting, and improved API docs

