State of the Model Serving Communities - October 2025

Most recent updates from several AI/ML model inference communities that our teams at Red Hat AI are contributing to.

Oct 09, 2025

Hi everyone,

This newsletter will provide you with recent updates on various model serving communities, keep you informed about Red Hat AI’s contributions to upstream communities, and foster collaborations across teams and organizations.

In case you missed it, last month we launched this newsletter publicly on Substack and now we have 600+ subscribers already! Feel free to share with others who might be interested in receiving future updates.

Contributors: Nick Hill, Nir Rozenbaum, Maroon Ayoub, Greg Pereira, Jooho Lee, Carlos Costa, Sasa Zelenovic, Yuan Tang

Executive Summary

Community Outreach: Upcoming events include IBM TechXchange in October, featuring talks by Carlos Costa on LLM inference and Yuan Tang on KServe, and KubeCon North America in November, where Yuan Tang and Anjali Telang will deliver a keynote on trust in AI. The vLLM community has hosted meetups in Austin, Boston, and Toronto, with 7 new meetups planned for Q4 in various international cities. Community metrics show vLLM adoption exceeding 1 million installs per week, LLM Compressor experiencing nearly 100% month-over-month growth (with 26k installs in each of the last two weeks and over 2k GitHub stars), and vLLM contributions averaging 200-250 commits per week, with Red Hat and IBM consistently contributing 25% of all commits.
WG Serving: The Gateway API Inference Extension (GIE) released v1.0.1, focusing on stabilization, quickstart guides, CRDs, integrations, performance improvements, and Helm chart enhancements. The Serving Catalog project is reviewing AWS support for vLLM, and added LMCache setup and XLA cache for vLLM. Inference Perf released v0.2.0 with improvements in default concurrency, multi-process CPU utilization, SGLang and TGI support, new dataset support for various use cases, automatic request rate sweeping, and observability enhancements.
KServe: KServe is now an official CNCF incubating project. Key updates include the release of 0.16 rc0 with LLM-d integration, a new tutorial document for KServe + LLM-d + Envoy AI Gateway, and a new Rolling Strategy API with Availability and ResourceAware modes.
llm-d: Updates from various SIGs from the llm-d community:
- Inference Scheduler: New proposal to extend the existing pluggable framework and a new K8s WG AI Gateway WG has been formed.
- PD Disaggregation: Work is commencing on items for the next release, focusing on improved memory footprint for WideEP decoders, robustness and reliability for P/D, tooling for debugging the llm-d high-performance networking stack, and GB200 WideEP.
- Benchmarking: Inference Perf v0.2.0 has been released with scalability improvements. The capacity planner is being enhanced. Benchmark data for key well-lit paths is available.
- KV Disaggregation: vLLM-Native CPU Offloading Connector has been integrated into vLLM. Precise prefix-cache aware scheduling has been enhanced with granular GPU/CPU visibility.
- Installation: Testing a composable install for v0.3, which involves creating resources like httpRoutes out of band. The wide-ep-lws example has been migrated to use only upstream GIE charts and static deployment manifests.
- Autoscaling: Work is in progress to gather metrics per variant from EPP to support multiple SLO classes per model. Documentation is being developed on how to work with multiple inference pools and how llm-d provides recommendations. Scaling to and from zero support is under development, with a focus on scaling instances rather than the EPP.
- Observability: A new script to generate metrics and useful queries for building dashboards. llm-d guides now default to creating Prometheus monitors for vLLM & EPP metrics.
vLLM: With the v0.11 release, the V0 engine has been removed. Progress made on CI overhaul for stability, speed and cost reduction. Significant effort in support of frontier OSS model releases Quen3-Next and DeepSeek-V3.2-Exp. KVCache CPU offloading is now supported natively. Work is underway on support for deterministic inference. Async scheduling performance has improved, with a goal to enable it by default. Focus continues on “Wide EP” optimizations including Dual-Batch-Overlap (DBO) and Decode Context Parallel (DCP). Contributors and community activity growth show no signs of slowing.
Llama Stack (Inference): The Llama Stack project released v0.2.21, v0.2.22, v0.2.23, with a significant backward-incompatible version 0.3.0 planned next, which will streamline APIs towards OpenAI compatibility and remove some existing features like batch inference. These releases introduced OpenAI Prompts API, automatic default inference store during build, a write queue for the inference store, and enhancements to the Together provider for embedding and dynamic model support. There’s also improved model information by combining dynamic discovery with static embedding metadata, and standardization of Ollama and Fireworks providers with an OpenAI compatibility layer.

Community Outreach

Upcoming events
- IBM TechXchange in October
  - Learn how LLM inference goes distributed with llm-d by Carlos Costa
  - KServe Deep Dive: Evolving Model Serving for the Generative AI Era by Yuan Tang
- KubeCon North America in November
  - Red Hat will deliver a day 1 keynote Anchoring Trust in the Age of AI: Identities Across Humans, Machines, and Models by Yuan Tang and Anjali Telang
  - There’s no change in other relevant sessions. Please see details in last month’s updates.
vLLM community
- Meetups
  - Past meetups: Austin, Boston, Toronto. 1,132 registered and 385 attended.
  - Planning 7 new vLLM meetups in Q4 - Tokyo, Beijing, Zurich, Frankfurt, Bangkok, Singapore, San Diego
- Bi-weekly office hours (find the slides in YouTube descriptions, linked below)
  - Hybrid Models as First-Class Citizens in vLLM
  - Intelligent Inference Scheduling with vLLM and llm-d
  - Next up on October 9, 2025: AI-Powered vLLM Semantic Router. Register here.
  - View all previous vLLM office hours recordings here.
- Community metrics:
  - vLLM adoption: Growing steadily, now>1M installs per week.
  - LLM Compressor: nearly 100% MoM growth.
    - ~10k weekly installs in August; ~20k in September.
    - 26k installs in each of the last two weeks.
    - GitHub stars crossed 2k.
  - vLLM contributions:
    - ~200 commits/week in August; ~250/week in September.
    - Red Hat + IBM = Steady 25% of all commits.

WG Serving

Communications: #wg-serving channel in Kubernetes Slack and mailing list
Workstreams and subprojects updates (note: you’ll need to join the WG Serving mailing list to access some of the documents)
- Subprojects updates:
  - Gateway API Inference Extension (GIE):
    - Release v1.0.1 is out. This month work was mainly focused on stabilizing GIE with the first GA release, including all quickstart guides, CRDs, integrations and performance improvements. This work included tremendous effort from multiple organizations to get multiple Gateways to work with the latest release and test it thoroughly in order to find and fix issues. It also included a lot of enhancements to the helm charts to make sure it’s highly pluggable to allow configuring EPP easily through helm deployment.
  - Serving Catalog:
    - AWS support in vLLM example is under review
    - Added LMCache setup as a new component
    - Added XLA cache for vLLM
  - Inference Perf:
    - Release v0.2.0 is out, including the following major improvements:
      - Default concurrency improvement and multi-process CPU utilization improvement with extensive scale testing
      - Enhanced support for SGLang and TGI with model server metrics
      - New dataset support for summarization, prefill heavy and decode heavy use cases
      - Automatic sweep of request rates until saturation
      - Observability improvements around load generation and the ability to monitor scheduling delay and achieved rate

KServe

Communications: #kserve channel in the CNCF Slack
KServe has passed the TOC vote and is now an official CNCF incubating project!
0.16 rc0 has been released, which includes llm-d integration
KServe + llm-d + Envoy AI Gateway tutorial doc
New Rolling Strategy API with the following two modes
- Availability mode: new pods are launched before old ones are terminated
- ResourceAware mode: old pods are terminated before new ones are launched

llm-d

Communications: Slack, mailing list, and community meeting
Community metrics: 46k installs in September (vs. 24k in August).
SIG updates (note: you’ll need to join the llm-d mailing list to access the documents):
- Inference Scheduler
  - A new proposal to extend the existing pluggable framework is available
  - A new working group AI Gateway WG has been formed
  - vLLM-sim: the /tokenize REST endpoint support merged
- Benchmarking
  - Inference Perf v0.2.0 has been released with scalability improvements, the capacity planner is being enhanced, and separate analysis notebooks are now available for each
  - v0.3.0 release: Benchmark data for key well-lit paths is now available.
  - Benchmarking and usability: Top priorities include making the pure run command more robust and developing more user-friendly benchmark reports and analysis tools.
- PD Disaggregation
  - We’re starting to work on items for the next release. Priorities include:
    - Improved memory footprint for WideEP decoders
    - Robustness + reliability for P/D
    - Tooling for debugging the llm-d high performance networking stack
    - GB200 WideEP
- KV Disaggregation
  - vLLM-Native CPU Offloading Connector landed in vLLM
  - Precise prefix-cache aware scheduling enhanced with granular GPU/CPU visibility
  - Blog release: KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d
    - Why prefix caching matters, how naive load balancers waste it, and how llm-d’s cache-aware scheduling recovers the wins: x57 faster response times, x2 throughput at scale
- Installation
  - Testing out a composable install for v0.3, creating resources like httpRoutes out of band
  - Migrated wide-ep-lws example to using only upstream GIE charts + static deployment manifests
  - Support for additional clouds: Digital Ocean
  - Support for additional hardware backends: TPU, XPU (AMD ROCm deferred)
- Autoscaling
  - Work-in-progress on gathering metrics per variant from EPP to support multiple SLO classes per model.
  - Work-in-progress on documentation on how to work with multiple inference pools and how llm-d provides recommendations.
  - Work-in-progress on scaling to and from zero support, with a focus on scaling the instances and not the EPP.
- Observability
  - Improvements to llm-d/docs/monitoring, including script to generate metrics and useful queries for building dashboards
  - llm-d guides now default to creating Prometheus monitors for vLLM & EPP metrics - Prometheus is now a prerequisite unless explicitly disabled
  - Work-in-progress with SIG-observability to PoC auto-instrumentation with OpenTelemetry Operator for distributed tracing as well as turning on vLLM tracing.

vLLM

Communications: Slack
Two releases since the last update: v0.10.2, v0.11.0
As of v0.11.0, the V0 vLLM engine has now been fully removed from the codebase.
Focus continues on optimizations for “Wide EP” workloads including Dual Batch Overlap (DBO).
Work was kicked off on support for fully deterministic batched inference, in support of RL use cases and more robust regression testing.
Progress has been made on CI speed and stability, with ongoing improvements underway.
MoonshotAI released a Checkpoint Engine for efficient RL weight updates with vLLM at trillion-parameter scale.
Notable additions since the last update:
- LRU-based KV cache CPU offloading is now supported natively.
- FULL_AND_PIECEWISE is now the default mode for CUDA graphs, providing better out-of-the-box support for most models while preserving compatibility for models that support only PIECEWISE.
- aarch64 is now supported, allowing use of vLLM on GB200 platforms.
- Decode Context Parallel (DCP) is now supported with MLA for more efficient kv cache use.
- NVIDIA compute capability < 8.0 (V100, T4, etc.) is now supported again in V1.
- Async scheduling: Forward pass input prep is now fully overlapped, providing significant speed-up in low-latency cases. Async scheduling will be on by default in a future release.
- Dep updates: Pytorch 2.8, FlashInfer 0.3.1, CUDA 13, ROCm 7.0, Xgrammar 0.1.23.
- Models:
  - New model families and enhancements: DeepSeek-V3.2-Exp, Qwen3-VL series, Qwen3-Next, OLMo3, LongCat-Flash, Dots OCR, Ling2.0, CWM, Apertus, LFM2, MiDashengLM, Motif-1-Tiny, Seed-OSS,, Google EmbeddingGemma-300m, GTE sequence classification, Donut OCR model, KeyeVL-1.5-8B, R-4B vision model, Ernie4.5 VL, MiniCPM-V 4.5, Ovis2.5, InternVL3.5 with video support, Qwen2Audio embeddings, NemotronH Nano VLM, Whisper encoder-decoder for V1. RADIO encoder support, preliminary Transformers backend encoder-only support.
  - Pipeline parallelism support for more models, data parallel for more vision models.
  - Added LoRA support to Voxtral, Qwen-2.5-Omni, and DeepSeek models V2/V3/R1-0528, with significantly faster LoRA startup performance.
  - Task expansion: BERT token classification/NER, multimodal models for pooling, Multi-label classification support, logit bias and sigmoid normalization, and FP32 precision heads for pooling models.
  - Other Features: Qwen3-VL text-only mode, EVS video token pruning, Mamba2 TP+quantization, MRoPE + YaRN, Whisper on XPU, LongCat-Flash-Chat tool calling, SeedOSS reason parser, EAGLE3 for MiniCPM3 and GPT-OSS.
- Engine Core:
  - Added cross-attention KV cache for encoder-decoder models, request-level Logits Processor support, and KV events from connectors.
  - Backend expansion: Terratorch integration enabling non-language model tasks like semantic segmentation and geospatial apps with --model-impl terratorch support.
  - Hybrid and Mamba model improvements: Disabled prefix caching for hybrid/Mamba models, added FP32 SSM kernel support, full CUDA graph support for Mamba1.
  - Multimodal caching improvements, improved V1 video embedding estimation.
  - Sampling and structured outputs: Support for all prompt logprobs, final logprobs, grammar bitmask optimization, and user-configurable KV cache memory size.
  - Hybrid memory allocator now supports pipeline parallel, varying hidden sizes.
  - Attention: Hybrid SSM/Attention in Triton, FlashAttention 3 for ViT.
  - Performance: FlashInfer RoPE 2x speedup, fused Q/K RoPE 11% improvement, 8x spec decode overhead reduction, FlashInfer spec decode with 1.14x speedup, model info caching, inputs_embeds copy avoidance.
  - Weight loading: Various improvements including multi-threaded weight loading, --safetensors-load-strategy for NFS based file loading acceleration.
- Large-scale serving and Performance
  - Dual-Batch Overlap (DBO), DeepEP high throughput + prefill.
  - Data Parallelism: torchrun launcher, Ray placement groups, Triton DP/EP kernels.
  - EPLB: Hunyuan V1, Mixtral, static placement, reduced overhead.
  - Disaggregated serving: KV transfer metrics support, NIXL MLA latent dimension.
  - MoE: Shared expert overlap optimization, SiLU kernel for DeepSeek-R1, Enable Allgather/ReduceScatter backend for NaiveAllToAll.
  - NCCL symmetric memory with 3-4% tput improvement, enabled by default for TP.
- Hardware and Model Performance:
  - NVIDIA Blackwell/SM100 generation: FP8 MLA support with CUTLASS and FlashInfer backends, DeepGEMM Linear with 1.5% E2E throughput improvement, Hopper DeepGEMM E8M0 for DeepSeekV3.1, SM100 FlashInfer CUTLASS MoE FP8 backend, MXFP4 fused CUTLASS MoE, default MXFP4 MoE on Blackwell, and GPT-OSS DP/EP support with 52,003 tokens/s throughput. BF16 fused MoE for Hopper/Blackwell expert parallel.
    - FlashMLA disabled on Blackwell GPUs due to compatibility issues.
  - Kernel and attention optimizations: FP8 FlashInfer MLA decode, FlashAttention MLA with CUDA graph support, V1 cross-attention support, FP8 support for FlashMLA, fused grouped TopK for MoE, Flash Linear Attention kernels, and W4A8 support on Hopper.
  - Performance improvements: 13.7x speedup for token conversion, TTIT/TTFT improvements for disaggregated serving, symmetric memory all-reduce by default, FlashInfer warmup during startup, V1 model execution overlap (with async scheduling), and various Triton configuration tuning.
  - Platform expansion: Apple Silicon bfloat16 support for M2+, IBM Z V1 engine support, Intel XPU torch.compile, XPU MoE data parallelism, XPU Triton attention, XPU FP8 quantization, and ROCm pipeline parallelism with Ray.
  - Model-specific optimizations: Hardware-tuned MoE configurations for Qwen3-Next on B200/H200/H100, GLM-4.5-Air-FP8 B200 configs, Kimi K2 optimization (#24597), and QWEN3 Coder/Thinking configs.
  - DeepGEMM: Enabled by default, 5.5% throughput improvement.
  - New architectures: RISC-V 64-bit, ARM non-x86 CPU, ARM 4-bit fused MoE.
  - AMD: ROCm 7.0, GLM-4.5 MI300X tuning. Intel XPU: MoE DP accuracy fix.
- Quantization:
  - Per-layer quantization routing, GGUF quantization with layer skipping, NFP4+FP8 MoE support, W4A8 channel scales, AMD CDNA2/CDNA3 FP4 support.
  - Compressed tensors transforms for linear operations enabling techniques like SpinQuantR1R2R4 and QuIP methods.
  - FP8: KV cache for TRTLLM prefill attention and torch.compile, qkv attention kernels, per-tensor GEMMs, per-token-group quantization, h/w accelerated instructions.
  - FP4: NVFP4 for dense models, faster W4A8 preprocessing.
  - ROCm TorchAO quantization enablement and TorchAO module swap configuration.
  - MXFP4 MoE loading cache optimization and compressed tensors version updates.
  - Breaking change: Removed original Marlin quantization format.
- API / CLI / Front-end:
  - OpenAI API enhancements: Gemma3n audio transcription/translation endpoints, transcription response usage statistics, and return_token_ids parameter.
  - Response API: Streaming support for non-harmony responses, non-streaming logprobs, MCP streaming+background support, tool output token reporting.
  - Frontend optimizations: Error stack traces with --log-error-stack, collective RPC endpoint, beam search concurrency optimization, custom media UUIDs.
  - Formalized --mm-encoder-tp-mode flag, VLLM_DISABLE_PAD_FOR_CUDAGRAPH environment variable, EPLB configuration parameter, embedding endpoint chat request support.
- Hundreds of other fixes and improvements to performance and function.
Articles and blog posts

Llama Stack (Inference)

Communications
- Discord server
- Community office hours happen weekly Thursdays at 12pm EST on Discord
Recent releases (changelog): v0.2.21, v0.2.22, v0.2.23. Notable changes related to inference include:
- Important announcement from the maintainers:
  - The release after v0.2.23 will be 0.3.0 which will be backward incompatible. We will change API URLs, drop some APIs (e.g., batch inference) etc. in favor of the OpenAI suite, and perform other kinds of cleanup. If important bug fixes are asked for and it takes us a while to get 0.3.0 to stable, we will fork off a 0.2.23.x bug fix release from the 0.2.23 branch.
- New features:
  - Added OpenAI Prompts API
  - Added default inference store automatically during llama stack build
  - Introduced write queue for inference store
  - Enhanced Together provider with embedding and dynamic model support
  - Combines dynamic model discovery with static embedding metadata for better model information
  - Standardizes Ollama and Fireworks provider with OpenAI compatibility layer
- Bug fixes:
  - Fixed broken Fireworks chat completion implementation
  - Fixed AWS Bedrock inference profile ID conversion for region-specific endpoints
  - Fixed inference recorder to handle both Ollama and OpenAI model
  - Fixed vLLM inference recording (await models.list)
- Other notable changes:
  - Updated several inference providers’ implementations to use openai-python for openai-compat functions
  - Removed openai dependency from providers

InferenceOps

Discussion about this post

Ready for more?