State of the Model Serving Communities - January 2026
Most recent updates from several AI/ML model inference communities that our teams at Red Hat AI are contributing to.
Hi everyone,
Happy New Year everyone! We hope you enjoy the holidays. This newsletter will provide you with recent updates on various model serving communities, keep you informed about Red Hat AI’s contributions to upstream communities, and foster collaborations across teams and organizations.
In case you missed it, last year we launched this newsletter publicly on Substack and now we have over 1,200 subscribers already! Feel free to share with others who might be interested in receiving future updates.
Contributors: Nir Rozenbaum, Sasa Zelenovic, Pete Cheslock, Wentao Ye, Yuan Tang
Executive Summary
Community Outreach:
vLLM: Shared the “vLLM 2025 Retrospective & 2026 Roadmap” with video and slides. Launched a new website, vllm.ai, with an events calendar. Upcoming weekly vLLM Office Hours covers topics like batch invariant, Speculators, LLM Compressor, and CPU offloading.
llm-d: Created a new page for conference talks. A couple of sessions are accepted for the upcoming KubeCon EU. The first llm-d meetup will be in March in NYC.
WG Serving:
GIE: Released v1.3.0 rc2 with major enhancements to stabilize and evolve flow control, re-add model rewrite via a new CRD, implement a latency predictor as an experimental scorer, and add DAG verification for plugins. Introduced body-based routing to manage multiple inference pools seamlessly.
Serving Catalog: Added the tpu7x machine type. Azure support for LLaMA3-8B is currently in progress.
Inference Perf: Fixed vLLM prefix metrics. Added mTLS support to the vLLM client. Implemented percentiles configuration for request lifecycle metrics reporting.
KServe: We are bumping the Gateway API Inference Extension (GIE) version to v1.2.0 and updating the LLMISVC group from v1alpha1 to v1alpha2. We should expect a new release once this is merged.
llm-d: Updates from various SIGs from the llm-d community:
Inference Scheduler: Released v0.4, working on v0.5, completed data-parallel aware scheduling, merged support for multiple InferencePools, and is implementing autoscaler scale-to/from-zero support. Several talks were accepted for KubeCon EU Amsterdam 2026.
Benchmarking: Released v0.4, submitted a “simple benchmarking” guide, supports inferenceMAX, and is preparing integration documentation with LLM-D guides and SIG-Installation.
PD Disaggregation: Released vLLM 0.12.0, working on WideEP on GB200 deployment, integrating DP-Aware Scheduling into WideEP guides, and addressing setup issues for vLLM GB200 WideEP.
KV Disaggregation: Renamed the repository to llm-d-kv-cache, released v0.4.0, is working on v0.5.0 (focusing on Multi-Modal/Multi-Model/LoRAs support, Disaggregated tokenization, and Active-active HA), merged an enhanced CPU offload PR, and landed LoRA support in precise-scheduling.
Installation: Continuing the Helm to Kustomize refactor, working on integration with SIG-Benchmarking, and upgraded to Istio 1.28.1.
Autoscaling: Released the Workload Variant Autoscaler (WVA) as an experimental feature in v0.4.0, made architectural changes for scale-to/from-zero support, and is working on WVA integration with the LLM-D benchmark.
Observability: Discussed the need for more metrics (e.g., exposing EPLB state, tracking DBO fallout reasons), and is preparing documentation for tracing, which was previously merged in GAIE.
vLLM: vLLM’s default CUDA build now targets PyTorch 2.9.0 + CUDA 12.9.1, with FlashInfer v0.5.2 as a default dependency and broader batch‑invariant torch.compile support (incl. Hopper/Blackwell). It also improves async scheduling robustness, adds MoE Shared Expert Overlap (~4% E2E on DeepSeek-style models), expands model/Blackwell optimizations, and introduces Anthropic /v1/messages API support.
Llama Stack: The recent releases (v0.4.0, v0.3.5, v0.3.4) of llama-stack introduced several key changes. Notable new features and improvements include architectural enhancements (API/provider separation, FastAPI migration), significant Vector Store enhancements (query rewrite support, hybrid/keyword search for Qdrant and ChromaDB, persistence, and returning metadata/embeddings), new providers for inference, etc.
Community Outreach
vLLM community:
Shared “vLLM 2025 Retrospective & 2026 Roadmap”. Check it out to see what happened with vLLM in 2025 and what we have in store for 2026 (Video, Slides)
Launched a new website: vllm.ai for the community to stay current with all vLLM events and community happenings. Check out our new events calendar to see upcoming office hours, meetups, and more.
vLLM Office Hours are happening every week in January; register here.
Jan 8: Intro to batch invariant in vLLM
Jan 15: Intro to Speculators, a unified library for building and storing speculative decoding algorithms for LLMs with vLLM
Jan 22: LLM Compressor update
Jan 29: Deep Dive into the vLLM CPU offloading connector
We’ve seen high usage of our optimized LLMs in 2025; check them out on our Hugging Face repo to achieve more performance and efficiency. If you want to learn how we optimize LLMs using the LLM Compressor, and how you can do the same, join our vLLM Office Hours on January 22, 2026, where we’ll share what’s new with LLM Compressor including Attention and KV Cache Quantization, Model-Free PTQ, AutoRound, and MXFP4.
llm-d community:
We created a new videos page to highlight all the llm-d related conference talks at recent events like Pytorch Conf and KubeCon: https://llm-d.ai/videos
Join KubeCon EU as llm-d will be highlighted in various talks at the event. To find other events where llm-d contributors are presenting, visit the llm-d events page: https://llm-d.ai/docs/community/events
Subscribe to the llm-d Youtube channel for not only community meeting recordings, but also to watch technical demos from the community - such as this demo showing 90%+ KV Cache hit rates by using llm-d to scale inference.
SAVE THE DATE: For the first llm-d meetup happening Wednesday March 11, 2026 at One Madison in NYC. Follow llm-d on our social media when registration opens for this free event (Twitter/X, LinkedIn, Bluesky).
WG Serving
Communications
#wg-serving channel in Kubernetes Slack and mailing list
Subprojects updates (note: you’ll need to join the WG Serving mailing list to access some of the documents)
Gateway API Inference Extension (GIE):
Release v1.3.0 RC2 is out.
Extra work has been put to stabilize and evolve flow control (queuing mechanism when the system is saturated).
Model rewrite was added back using an additional CRD.
The latency predictor is implemented as an experimental scorer.
Added DAG verification for plugins (no option to create cyclic dependencies of plugins).
Starting from this version, BBR (body-based routing) can be used to seamlessly manage multiple inference pools, each for a different set of base+adapters, while keeping the same UX from the user perspective.
Talks accepted to KubeCon EU 2026:
Added tpu7x machine type
Azure support for LLaMA3-8B is in-progress
Fixed vLLM prefix metrics
Added mTLS support in the vLLM client
Added percentiles configuration for request lifecycle metrics reporting
KServe
Communications: #kserve channel in the CNCF Slack
PR #4886 bumps the Gateway API Inference Extension (GIE) version to v1.2.0 and updates the LLMISVC group from v1alpha1 to v1alpha2. We should expect a new release once this is merged.
llm-d
Communications
SIG updates (note: you’ll need to join the llm-d mailing list to access the documents):
Releases and Roadmaps: Released llm-d Inference Scheduler v0.4 and are actively working on the v0.5 roadmap. RC1 of IGW 1.3 is out.
Scheduling Features: Data-parallel aware scheduling with P/D support is complete. Verified DP-aware scheduling on a real cluster. Merged support for multiple InferencePools through ConfigMap and BBR. Exploring score-based strategies for LoRA-aware fairness scheduling
Autoscaling Support: Working on implementation and documentation for support of autoscaler scale-to-/from-zero.
Community: Several talks were accepted for KubeCon EU Amsterdam 2026.
Releases and Documentation: A v0.4 release happened in the third week of December. A “simple benchmarking” guide for all well-lit paths was submitted as a PR.
Features: inferenceMAX is a supported harness. Working on benchmark-integrated Wide-EP and better support for pre-loaded models.
Integration: Preparing integration documentation for llm-d guides. Working on a joint workstream with SIG-Installation for direct integration into guides.
Releases and Features: vLLM 0.12.0 was released. Working on WideEP on GB200 deployment and integrating DP-Aware Scheduling into the WideEP guides.
Wide EP: Continuing work on WideEP NVL72 and Elastic EP. Resolving setup issues for vLLM GB200 WideEP.
Optimizations: P/D scheduler fixes landed, and EPLB performance optimizations are in progress.
Repository: The repository was renamed to llm-d-kv-cache.
Releases and Roadmap: v0.4.0 was released, and work on the v0.5.0 roadmap is in progress, focusing on Multi-Modal/Multi-Model/LoRAs support, Disaggregated tokenization, and Active-active HA.
CPU Offloading & Storage: Enhanced CPU offload PR merged with significant performance improvements. llm-d Storage connector is landing next week. Storage connector PRs (like PVC Evictor and Active-active HA) are progressing.
Scheduling: LoRA Support in precise-scheduling landed, and there are ongoing efforts for Multi-Modal precise prefix scheduling (vision).
Refactor: Continuing work on the Helm to Kustomize refactor and fixing lagging guide updates.
Integration: Working on a joint workstream with SIG-Benchmarking for direct integration of benchmarking into guides. Upgraded to Istio 1.28.1 for consuming the new v1 inference.networking.k8s.io infernecepool API.
Releases and Features: The Workload Variant Autoscaler (WVA) was released as an experimental feature in llm-d v0.4.0.
Scale-to-/from-Zero: Architectural changes now support scale to and from zero and running multiple optimizers. Progress on the reconciler split and scale from zero implementation is ongoing.
Integration: Integration of WVA 0.4.2 with the llm-d benchmark is in progress. Working on a plugin-based mechanism for different data sources for scale-from-zero.
Metrics Proposals: Discussions on Slack and a detailed document were created on the need for more metrics. Key proposals for vLLM include exposing EPLB state as Prometheus metrics, tracking DBO fallout reasons, and adding opt-in per-layer and per-expert metrics.
Tracing: The Tracing proposal is due for another review. Tracing was previously merged in GAIE and documentation is being prepared.
vLLM
Communications: Slack
Articles and blog posts
Tracing Hanging and Complicated GPU Kernels Down To The Source Code
Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor
Diving into speculative decoding training support for vLLM with Speculators v0.3.0
vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving
Token-Level Truth: Real-Time Hallucination Detection for Production LLMs
Run Highly Efficient and Accurate AI Agents with NVIDIA Nemotron 3 Nano on vLLM
Encoder Disaggregation for Scalable Multimodal Model Serving
AMD × vLLM Semantic Router: Building the System Intelligence Together
vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP
Notable additions since the last update:
Default to PyTorch 2.9.0 + CUDA 12.9.1: Default CUDA build now targets torch==2.9.0+cu129, enabling Inductor partitioning, and landing multiple fixes in graph-partition rules and compile-cache integration.
FlashInfer v0.5.2 as Default CUDA Dependency: Upgraded FlashInfer to v0.5.2 and made it a default CUDA dependency, with new support for CUDNN FP4 GEMM and FP8 blockscale on SM90.
Batch-invariant torch.compile: Generalized batch-invariant support across attention and MoE backends. Explicit support for DeepGEMM and FlashInfer on Hopper and Blackwell GPUs.
Robust async scheduling: Fixed several correctness and stability issues in async scheduling, especially when combined with chunked prefill, structured outputs, priority scheduling, MTP, and DeepEP / DCP. We expect --async-scheduling to be enabled by default soon.
Shared Expert Overlap: Added concurrent execution of shared experts and selected experts in MoE layers with TP + EP, yielding 4% E2E improvement on DeepSeek-style models.
New Model Support: Added support for DeepSeek-V3.2, MiniMax-M2, Kimi Linear, DeepSeek-OCR, PaddleOCR-VL, LightOnOCR, Siglip2, FlexOlmo, NemotronH MoE, Qwen3-Omni MoE Thinker, and many more models.
Blackwell (SM100+) Optimizations: Multiple fixes and optimizations for NVIDIA Blackwell GPUs including INT8 quantization, CUTLASS FP8 GEMM improvements, and automatic PIECEWISE cudagraph mode for long contexts.
Stronger scheduler + KV connector ecosystem: Improved test coverage in CI and made scheduler behavior more robust with KV connectors, prefix caching, and multi-node deployments.
Anthropic API Support: Added support for the /v1/messages endpoint, allowing users to interact with vllm serve using Anthropic-compatible clients, including Claude Code.
Llama Stack
Communications
Community office hours happen weekly Thursdays at 12pm EST on Discord
Recent releases (full list): v0.4.0, v0.3.5, v0.3.4. Notable changes include:
New features and improvements:
Architectural improvements such as API/provider separation, FastAPI migration, and Inspect API’s improved default behavior.
Vector Store enhancements, including:
Query Rewrite Support: Added query rewrite capabilities in vector_store.search
Qdrant Improvements: Hybrid and keyword search support
ChromaDB Enhancements: Keyword search and delete_chunk implementation
Persistence: Vector stores now persist across server restarts
Metadata & Embeddings: Return embeddings and metadata from vector store methods
Model Discovery: List available models via provider_data header
Standardized Configuration and new providers for model inference
New APIs: Read-only Connectors API, File Processor API skeleton, and Admin API.
Please check out v0.4.0 release notes for additional new features.
Bug fixes:
RBAC bypass vulnerabilities in model access
Respect table_name config in InferenceStore
InferenceStore workers being cancelled on event loop change
Other notable changes:
Breaking changes in configuration files, API removals, VectorStore API renames, and API behavior changes. Please check out v0.4.0 release notes for details.


Thanks for the detailed update.
The focus on Scale-to-Zero support in WVA is awesoem! what the industry needs right now.
We’ve been tackling the runtime side of this challenge (achieving <2s loading via memory tiering) to make that autoscaling logic viable for latency-sensitive workloads. We actually just shared our operator architecture with Rob Shaw to see how it might align with these new llm-d patterns.