State of the Model Serving Communities - November 2025

Most recent updates from several AI/ML model inference communities that our teams at Red Hat AI are contributing to.

Nov 05, 2025

Hi everyone,

This newsletter will provide you with recent updates on various model serving communities, keep you informed about Red Hat AI’s contributions to upstream communities, and foster collaborations across teams and organizations.

In case you missed it, last month we launched this newsletter publicly on Substack and now we have 800+ subscribers already! Feel free to share with others who might be interested in receiving future updates.

Contributors: Pete Cheslock, Sasa Zelenovic, Wentao Ye, Nick Hill, Nir Rozenbaum, Jooho Lee, Yuan Tang

Executive Summary

Community Outreach: Red Hat will be at KubeCon + CloudNativeCon North America in Atlanta the week of November 10th at booth #100. The vLLM community has seen significant activity, with LLM Compressor surpassing 30,000 weekly installs, successful office hours, and numerous meetups globally, including upcoming events in Zurich, Paris, Bangkok, Hyderabad, Malaysia, and San Diego. The llm-d community has presented at AMD Dev Day and PyTorch Conf, is seeking feedback for its v0.4 release, and has upcoming talks at Cloud Native + Kubernetes AI Day and KubeCon.
WG Serving: WG co-chairs will deliver a KubeCon session titled “Navigating the Rapid Evolution of Large Model Inference: Where does Kubernetes Fit?”. GIE released v1.1.0 with experimental features like Flow Control, multi-port support, and multi-cluster support. Serving Catalog added LMCache and GCSFuse components, and is adding support for AKS & EKS templates. Inference Perf team is working on release v1.0.0, which will include synthetic multiturn chat support, dataset trace replay support, and CI/CD and testing.
KServe: KServe will be featured in a keynote session at KubeCon Atlanta and a dedicated session on Cloud Native AI Day. A CNCF announcement blog about KServe joining CNCF as an incubating project is scheduled to be posted during KubeCon. Notable community PRs includes the addition of a new time-series forecast API, progressive rollout for raw deployments, and the refactoring of LLMISvc manifests to enable modular installs and improve maintainability and clarity.
llm-d: Updates from various SIGs from the llm-d community:
- Inference Scheduler: Moved sidecar code, and has proposals open for comments on EPP as a Standalone Request Scheduler, an extension to the pluggable framework, and serving online batch via inference gateway.
- Benchmarking: Integrated with WVA, offers Configuration Explorer for evaluation, removed fmperf support for InferenceMAX, and integrated capacity planner into standup.sh.
- PD Disaggregation: Updates on Wide EP (fixed CUDA_VISIBLE_DEVICES issue, enabled PD transfers, added a basic test, in-progress All2All kernels and GB200 NVL72), Elastic EP milestone 2 PR, and llm-d 0.4 Pareto chart automation.
- KV Disaggregation: Plans to add LMCache and native CPU offloading, proposals for KV cache storage and CPU medium support, an RFC on llm-d native storage connector, and updates on Valkey, RDMA, gRPC Service, and UDS-based external tokenizer service.
- Installation: Modularized reusable recipes, added umbrella KV cache offloading folder structure, verified AKS well-lit path, and added hardware integration/enablement and uninstallation documentation.
- Autoscaling: Started integration with llm-d-inference-simulator, added installation scripts for 0.4, refactoring the VariantAutoscaling CRD, added TLS verification and logging for prod, and has proposals open for comments on scale-to/from zero and metrics for scheduler-to-autoscaler information exchange.
- Observability: The Tracing SDK was merged into GIE, documentation for enabling tracing in vLLM & EPP was added, dashboards were updated, and EPP can now register metrics from extensions.

vLLM: vLLM completes V0 deprecation and unifies on stable V1 OpenAI‑compatible APIs (token‑ID returns, extra_body/metadata), while strengthening observability (StatLogger, tracing) and execution (async scheduling, DCP, torch.compile, CUDA Graphs). Expanded model/multimodal and backend coverage (TPU PyTorch/JAX, Intel XPU, ROCm), FP8/DeepGEMM/Blackwell, and batch‑invariant determinism—plus zero‑reload Sleep Mode and modernized docs—make vLLM a production‑ready, high‑performance choice for heterogeneous AI inference.
Llama Stack (Inference): Recent releases v0.3.1 and v0.3.0 have introduced several key improvements, including stable OpenAI-compatible APIs, a clear separation of APIs into stable, experimental, and deprecated categories, and support for extra_body/metadata in APIs to offer more functionality than the standard OpenAI implementation. Additionally, the documentation has been significantly overhauled, now utilizing Docusaurus for modern formatting and improved API documentation. These releases were made possible by over 30 contributors, including 8 new contributors.

Community Outreach

Visit Red Hat at KubeCon + CloudNativeCon North America in Atlanta in the week of Nov 10th.
- Visit the Red Hat booth #100 onsite in the sponsor showcase.
- Check out all the sessions and events from Red Hat.
vLLM community
- LLM Compressor surpasses 30,000 weekly installs, with 34,731 last week.
- Two vLLM office hours:
  - October 9 - AI Powered vLLM Semantic Router
  - October 23 - How to build and contribute to vLLM
  - Brought vLLM office hours to China, with 10,700 people attending the first session
- Meetups:
  - Beijing vLLM meetup, with 41,000 virtual and 352 in-person attendees.
  - Tokyo vLLM meetup executed in October.
  - Upcoming meetups In November:
    - Zurich: 300 registered and will be live streamed here
    - Paris: just opened and aiming for 300
    - Bangkok, Thailand: 180 registered
    - Hyderabad, India: just opened
    - Malaysia: planning for early December
    - San Diego: vLLM party during NeurIPS
llm-d community
- The community has been active at recent conferences, presenting talks at AMD Dev Day and PyTorch Conf in San Francisco.
- Feedback needed for v0.4 release: As we progress toward our v0.4 release, we are refining our “well-lit path” configurations. We are seeking community feedback on the new Kustomize-based method for deploying the “Wide-EP LWS” pattern and how it compares to the existing Helmfiles-based deployment.
- Upcoming talks: Look out for many new llm-d presentations at the upcoming Cloud Native + Kubernetes AI Day and KubeCon. For a full schedule, see the community events page.

WG Serving

Communications: #wg-serving channel in Kubernetes Slack and mailing list
Workstreams and subprojects updates (note: you’ll need to join the WG Serving mailing list to access some of the documents)
- Overall organization updates:
  - Check out our KubeCon session next week from the WG co-chairs: Navigating the Rapid Evolution of Large Model Inference: Where does Kubernetes Fit?
- Subprojects updates:
  - Gateway API Inference Extension (GIE):
    - Release v1.1.0 is out with a focus on sharing and enabling users to try experimental features:
      - Flow Control is available as an experimental feature.
      - Multi-port support is available with GW implementations that also support this. This enables sophisticated features like Wide EP. GW providers support forthcoming.
      - The API surface has been extended to experimentally support multi-cluster support.
  - Serving Catalog:
    - Added LMCache component for KVCache offloading
    - Added GCSFuse components for efficient model weight loading
    - Adding support for AKS & EKS templates
  - Inference Perf:
    - Working on release v1.0.0
      - Adding support for synthetic multiturn chat
      - Dataset trace replay support
      - CI/CD and testing

KServe

Communications: #kserve channel in the CNCF Slack
KServe will be featured in a keynote session at KubeCon Atlanta and a dedicated session on Cloud Native AI Day. The CNCF announcement blog about KServe joining CNCF as an incubating project is scheduled to be posted during KubeCon.
Community PRs worth paying attention to:
- Time Series Forecast API
  - Adds a new /v1/timeseries/forecast endpoint for time-series inference (univariate/multivariate).
  - Supports quantiles, schema validation, and metadata.
  - Makes KServe ready for time-series ML workloads (e.g., IoT, finance).
- Progressive Rollout for Raw Deployment
  - Introduces rollout strategies (Availability / ResourceAware) for model updates.
  - Simplifies deployment configuration compared to raw Kubernetes fields (maxSurge, maxUnavailable).
  - Improves control over model rollout safety and uptime.
- LLMISvc Manifests Refactor
  - Separates LLMISvc manifests and kustomizations from the default KServe setup.
  - Enables modular installs (LLMISvc-only or full KServe).
  - Improves maintainability and clarity in installation flows.

llm-d

Communications: Slack, mailing list, and community meeting
SIG updates (note: you’ll need to join the llm-d mailing list to access the documents):
- Inference Scheduler
  - Moved sidecar code to inference-scheduler repo
  - EPP as a Standalone Request Scheduler proposal is open for comments
  - Extension to the existing pluggable framework proposal is open for comments
  - Proposal for serving online batch via inference gateway is open for comments
- Benchmarking
  - Integrated with WVA (Workload Variant Autoscaler)
  - Configuration Explorer is available for evaluation
  - Removed support for fmperf and added support for InferenceMAX
  - Capacity planner integrated into standup.sh
- PD Disaggregation
  - Wide EP updates:
    - Fixed issue with CUDA_VISIBLE_DEVICES
    - Enabled PD transfers with Prefill TP to Decode TP
    - Added basic single node dual batch overlap test
    - All2Alls is new in PyTorch 2.9. Hybrid DeepEP All2All kernels is in progress
    - GB200 NVL72 is in progress
  - Elastic EP milestone 2 PR
  - llm-d 0.4 Pareto chart automation for InferenceMax Comparison
- KV Disaggregation
  - Plan to add both LMCache and native CPU offloading and evaluation
  - Proposal to add a well-lit paths for KV cache storage
  - RFC on llm-d native storage connector is open for comments
  - Proposal to add CPU medium support in inference-scheduler
  - RFC on cache hit threshold to handle preemptions in PD-Disaggregation and enable lightweight powerful P/D implementations
  - Add Valkey and RDMA support for KV-cache indexing
  - Added kvcache.Indexer as gRPC Service
  - Added UDS-based external tokenizer service
- Installation
  - Modularize reusable recipes for user guides
  - Adding umbrella kv cache offloading well-lit path folder structure
  - Verified AKS well-lit path
  - Added hardware integration and enablement docs and uninstallation docs
- Autoscaling
  - Started working on the integration with llm-d-inference-simulator
  - Added installation scripts and detailed guide for 0.4
  - Refactoring the VariantAutoscaling CRD from a multi-variant to a single-variant architecture
  - Added proper TLS verification and logging levels for prod
  - Scale-to/from zero implementation proposal is open for comments
  - Proposal for metrics for scheduler-to-autoscaler information exchange is open for comments
- Observability
  - The Tracing SDK has been merged in GIE
  - Added documentation on how to enable tracing in vLLM & EPP
  - Updated and improved dashboards
  - Allow EPP to register metrics from extension

vLLM

Communications: Slack
Articles and blog posts
Notable additions since the last update:
- Engine and APIs
  - Removed remaining V0 usage flags to complete the V0 deprecation and unify on V1
  - Refactored to unify shared logic between Chat Completions and Responses
  - Enabled StatLogger in `LLMEngine` for better observability
  - Removed deprecated CLI options `ropescaling` and `ropetheta`.
- Scheduling and execution
  - AsyncScheduling: prevent scheduling past perrequest `max_tokens` .
  - Perf: avoid separate thread for MP executor SHM spin
  - Fix incorrect `preallocated sampled_token_ids` tensor size
- Decode Context Parallel (DCP) robustness
  - Check `return_lse` across all layers.
  - Fix `dcp_local_seq_lens` calculation.
  - Address DCP assert on `reorder_batch_threshold`.
- Reasoning parsers and structured outputs
  - Added Kimi reasoning parser.
  - Fixed DeepSeekR1 ReasoningParser import .
  - Reduced logging noise for MinimaxM2 ToolParser import success message .
  - Lazyloaded `reasoning_parser` to reduce startup overhead.
- Models and multimodal
  - Granite Speech support and LoRA for STT.
  - Added support for `openPangu_Ultra_MoE`.
  - NemotronH: optimal Triton fused MoE configs and pipeline parallel fix.
  - Qwen3Next: MoE configs for A100SXM480GB TP4/TP8.
  - Transformers backend: fixed encoder-only model support.
  - Multimodal: made `MediaConnector` extensible.
- Backends and hardware
  - Intel XPU: IPEX custom routing functions for Llama4.
  - Intel XPU: added GPTOSS model support for Intel GPU.
  - ROCm: upstreamed `gemm_a16w16`; redesigned ROCm AITER MHA backend for perf.
  - CUTLASS/SM100: swapAB optimization for FP8 GEMM; fixed FP8 FusedMoE scaling factors .
  - GDN path: decoupled projections from custom op; later reverted in.
- Graph/Kernel/Weights
  - Graph partition/cache: use Inductor partition ops config.
  - Kernel code isolation for FusedMoE method base .
  - Support using `Int4PreshuffledTensor` after loading.
- Batch-invariant deterministic inference
  - Kernel override determinism foundation
  - FlashInfer backend kernel override (unrevert)
  - DeepSeek-V3 batch invariant on 8×H100
  - DeepGEMM + Blackwell support
  - Batch invariant for R1 TP8 on Blackwell
  - Torch compile & CUDA Graphs support

Llama Stack (Inference)

Communications:
- Discord server
- Community office hours happen weekly Thursdays at 12pm EST on Discord
Recent releases (changelog): v0.3.1 and v0.3.0, by 30+ contributors, including 8 new contributors. Notable changes related to inference include:
- Stable OpenAI-Compatible APIs
- Llama Stack now separates APIs into stable (/v1/), experimental (/v1alpha/ and /v1beta/) and deprecated (deprecated = True.)
- extra_body/metadata support for APIs which support extra functionality compared to the OpenAI implementation
- Documentation overhaul: Migration to Docusaurus, modern formatting, and improved API docs

InferenceOps

Discussion about this post

Ready for more?