State of the Model Serving Communities - September 2025
Most recent updates from several AI/ML model inference communities that our teams at Red Hat AI are contributing to.
Hi everyone,
This newsletter will provide you recent updates on various model serving communities, keep you informed about Red Hat AI’s contributions to upstream communities, and foster collaborations across teams and organization.
In case you missed it, last month we launched this newsletter publicly on Substack and now we have 300+ subscribers already! Feel free to share with others who might be interested in receiving future updates.
Executive Summary
Community Outreach: There are many accepted sessions at KubeCon North America and co-located events relevant to model serving from Red Hat teams. In addition, llm-d is featured in Kubernetes Podcast from Google and Technically Speaking with Chris Wright. There were multiple community events on vLLM in Asia as well as office hours.
WG Serving: The WG Serving community meetings are now monthly, offering subprojects, ecosystem projects, and researchers opportunities to present and demo their work. The GIE project released v1.0.0 rc3 with InferencePool GA, experimental Data Layer, enhanced Helm charts, improved conformance suite, and deprecation of InferenceModel CRD for InferenceObjective CRD. Serving Catalog has added support for additional machine types. The support for AKS is in-progress. The Inference Perf project is actively working towards its 0.2.0 release.
KServe: KServe is progressing on multiple fronts, including the CNCF incubation process, which is currently in internal CNCT TOC review with a public comment period to follow. The KServe website has been updated to highlight its GenAI capabilities. Additionally, the team is working on the final PR for llm-d integration, with E2E tests and documentation to be added after merging.
llm-d: Updates from various SIGs from the llm-d community:
Inference Scheduler: Progress on e2e tests, a pluggable GIE Data Layer, DP Scheduler support, and releases for llm-d scheduler v0.3.0 and IGW v1.0.
PD Disaggregation: Significant progress has been made in wide expert parallelism, with vLLM's performance nearing SOTA. Engagement with users on P/D disaggregation in llm-d is providing validation and insight into feature requirements for real production environments.
Benchmarking: Enhanced support for "design of experiments" methodology, unified performance report formatting, and improved tooling for performance graph creation.
KV Disaggregation: Establishment of vLLM-Native CPU offloading as a required feature, completion of v0.3.0 release, and preparation for v0.4.0 with multi-modal KV-cache indexing.
Installation: Migration to v0.3 with goals to reduce dependencies, use Helm charts for gateway infrastructure, and enable flexible release names.
Autoscaling: Discussions on graceful scale-down, alignment with vLLM, HPA integration with ScaleToZero, and improvements to Optimizer.
Observability: Ongoing review of PRs for moving observability dashboards and creating a Service Monitor, along with a roadmap for exposing missing metrics.
vLLM: vLLM now has more than 1500 contributors, and project co-creator Zhuohan Li has returned. A focused effort is underway to improve the CI. Lots of effort for the gpt-oss release and continued related improvements. Current major focus areas are performance of WideEP deployments, improving Nvidia Blackwell GPU support, numerical stability, project UX including ease of setup and startup speed. There were 4 vLLM meetups in Asia this month, and a record number of articles and blog-posts.
Llama Stack (Inference): The Llama Stack project has moved to a new GitHub organization, llamastack. Recent updates (v0.2.17-v0.2.20) include new features for the Inference Providers such as Flash-Lite 2.0/2.5 model integration, Google Vertex AI support, and a batches API for OpenAI compatibility and inference replay. Bug fixes addressed telemetry and missing module dependencies. Other changes involve removing SQLite from the inference recorder, standardizing InferenceRouter model handling, adding vision inference test support, and implementing an inference output re-recording workflow.
Community Outreach
Accepted sessions at KubeCon North America and co-located events from Red Hat teams (this is selected based on relevance to model serving. See a complete list of sessions from Red Hat here):
Navigating the Rapid Evolution of Large Model Inference: Where does Kubernetes Fit?
Tutorial: A Cross-Industry Benchmarking Tutorial for Distributed LLM Inference on Kubernetes
Kubeflow Ecosystem: Navigating the Cloud-Native AI/ML and LLMOps Frontier
Intelligent LLM Routing: A New Paradigm for Multi-Model AI Orchestration in Kubernetes
LLM-D, with Clayton Coleman and Rob Shaw at the Kubernetes Podcast from Google
Inside distributed inference with llm-d from Technically Speaking with Chris Wright
vLLM community events:
International vLLM meetups in Seoul (Aug 19th), Shanghai (Aug 23rd), Singapore (Aug 27th), and Shenzhen (Aug 30th).
gpt-oss meetup with vLLM,ollama and OpenAI in San Francisco on Aug 27th.
There are upcoming vLLM meetups in Boston on Sept 18th and Toronto on Sept 25th.
Bi-weekly Office Hours:
WG Serving
Communications: #wg-serving channel in Kubernetes Slack and mailing list
Workstreams and subprojects updates (note: you’ll need to join the WG Serving mailing list to access some of the documents)
Overall organization updates:
The community meeting has been changed to monthly. In the meantime, we want to provide opportunities for subprojects/ecosystem projects to present their projects and demo new features as well as researchers to share their work in future meetings. Please reach out to the co-chairs if interested.
Subprojects updates:
Gateway API Inference Extension (GIE):
Release v1.0.0 in its finalization process. RC3 is out, final release is expected in a few days. The new release includes: InferencePool GA; new experimental Data Layer; enhanced helm charts; improvements to conformance suite; and deprecation of InferenceModel CRD and introduction of a new InferenceObjective CRD instead.
Significant progress in Flow Control work (Fairness/SLOs). Expectation to get a workable version in the next release.
A proposal for a Multi-Cluster InferencePool is under discussion.
Discussions for a pluggable Body-Based-Routing (BBR) is undergoing.
Traffic split and model redirect are temporarily removed. Those will come back in next releases.
Support for AKS is in-progress
Added support for additional machine types
The community is actively working towards 0.2.0 release
KServe
Communications: #kserve channel in the CNCF Slack
CNCF incubation: We have completed all the required changes and finished the adopter interviews. The KServe incubation application has been moved to CNCF TOC internal review, followed by a public comment period.
KServe website has been updated to a new design that highlights its GenAI capabilities and latest improvements.
llm-d integration status update: we are working on the last PR to provide functioning llm-d integration. E2E tests and documentation will be added once this is merged.
llm-d
Communications: Slack, mailing list, and community meeting
SIG updates (note: you’ll need to join the llm-d mailing list to access the documents):
Scheduler e2e tests: Added tests utilizing IGW test infrastructure, covering non-P/D, P/D, and KV scenarios.
Pluggable/Extensible GIE Data Layer: Implementation based on the design merged to IGW, providing a standardized, extensible way to collect, store, and expose endpoint attributes for scheduling.
DP Scheduler Support: Released design doc and started initial implementation (work in progress).
llm-d scheduler v0.3.0 and IGW (InferencePool GA) v1.0 release cuts (to be promoted)
Support for “design of experiments” methodology, where all relevant factors, parameters and levels are described in a yaml file and then automatically executed
Performance data generated by multiple different “harnesses” (i.e., inference-perf, guidellm, vllm-benchmark and fmperf) is now formatted into a single, unified, standardized report in yaml format
Tooling (in the form of a Jupyter Notebook) is now provided to quick performance graph creation, including comparisons between multiple different sets of parameter values generated by an experiments
Better support for performance evaluation of already existing llm-d stacks (i.e., stacks which were not stood up directly through llm-d benchmark).
Significant forward progress on wide expert parallelism was made over the course of the month as vLLM performance with DeepSeek approaches SOTA with improvements primarily coming from a performant implementation of dual batch overlap and async scheduling in vLLM.
We have engaged with users on P/D disagg in llm-d, which provides opportunities to validate performance and understand feature requirements in real production environments.
vLLM-Native CPU offloading established as a required feature in vLLM
Reviews near completion with consensus to merge current implementation
v0.3.0 release cut (rc1, to be promoted)
OpenAI production ready Chat-Completions preprocessing library
Synchronous tokenization with caching
Expanded benchmarking and stronger test coverage
General code and documentation improvements
v0.4.0 draft prepared
Multi-modal KV-cache indexing support
vLLM-Native CPU offloading connector landed and integrated with the precise-prefix-cache scorer
Updated LMCache connector integration via KVEvents
v0.3 migration landed with the following core goals:
Cut down on dependencies, and move dependency docs alongside installation script to manage dependencies
Migrate gateway infrastructure deployment method to helm charts instead of scripts
Remove doc redundancy
Enable flexible release names working toward quickstart concurrency (waiting on 1 PR to land upstream in GIE, expected to be there for v0.3 release)
Graceful scale-down was discussed with the community
Aligning to the vLLM version packaged by the llm-d community
HPA integration with ScaleToZero on KinD is now enabled
Fixes to Optimizer to prioritize based on priority classes.
Alpha Feature in Optimizer to support TTFT.
Improved test coverage
Running tests on OpenShift
The PR in llm-d to move observability dashboards is under review
The PR to create a Service Monitor in GIE is under review
The roadmap for observability is available and the SIG is working with components teams to expose missing metrics
vLLM
Communications: Slack
Two releases since the last update: v0.10.1, v0.10.1.1 (latter with important security fixes)
Focus areas: Continued optimizations for “Wide EP” workloads; Nvidia Blackwell support; Usability including server startup time; New model support; Numerical stability; V0 deprecation.
The project is making a concerted effort to improve the CI, with many stability improvements underway and a near-term goal of 30 minute CI job time for pull requests.
Zhuohan Li, one of the two co-creators of vLLM, has moved from OpenAI to Meta where he will be contributing again to the project full-time. vLLM now has more than 1500 contributors.
Notable additions since the last update:
Full CUDA‑Graph compatibility with FlashAttention2 and FlashInfer; 6% end-to-end throughput improvement from Cutlass MLA.
Pooling models now default to chunked prefill and prefix caching, disabled chunked local attention by default for Llama4 for better performance.
Extensibility: V1 custom LogitsProcessors support for custom token sampling behavior; New model loader plugin system; Custom ops support for FusedMoe.
Models
GPT-OSS with comprehensive tool-calling support, Command-A-Vision, mBART, SmolLM3 via transformers backend, Nemotron-H
VLMs: Eagle support for Llama 4 multimodal, Step3 VLM, Gemma3n multimodal, MiniCPM-V 4.0, HyperCLOVAX-SEED-Vision-Instruct-3B, Emu3 with Transformers backend, Intern-S1,Prithvi in online serving mode.
GLM-4.5 series improvements, Ultravox support for Llama4 and Gemma 3; V1 support for Mamba1 and Jamba.
Encoder-only models without KV-cache enabling BERT-style architectures; extended support for additional pooling models.
Expanded tensor parallelism support in Transformers backend and Deepseek_vl2.
Hardware and Performance
NVIDIA Blackwell optimizations: CutlassMLA as default backend, FlashInfer MoE per-tensor scale FP8 backend, SM90 CUTLASS FP8 GEMM with kernel tuning and swap AB support; NVIDIA RTX 5090/RTX PRO 6000: Block FP8 quantization and CUTLASS NVFP4 4-bit weights/activations.
AMD: Flash Attention backend for Qwen-VL models, AITER HIP block quantization kernels and optimized kernel performance for small batch sizes 1-4.
ARM CPU build fixes for systems without BF16 support, Machete memory-bound performance improvements, FlashInfer TRT-LLM prefill attention kernel support, optimized reshape_and_cache_flash CUDA kernel, CPU transfer support in NixlConnector.
Attention: Tree attention for v1 (experimental), FlexAttention encoder-only support, updated FA3 with attention sink support, multiple attention groups for KV sharing patterns. Triton-based multi-dimensional RoPE replacing PyTorch implementation, async tensor parallelism for scaled matrix multiplication.
Mamba2 reduced device-to-device copy overhead, fused RMSNorm Triton kernels.
Balanced expert sharding for MoE models, expanded fused-kernel support for topk softmax, fused MoE for nomic-embed-text-v2-moe.
Improved multimodal hasher performance for repeated image prompts, multithreaded async multimodal loading; Structured output throughput improved.
Specialized kernels: GPT-OSS activation functions, RLHF weight loading.
Spec decoding optimizations: N-gram spec decoding with single KMP token proposal algorithm; explicit EAGLE3 interface for enhanced compatibility.
Improved startup time by disabling C++ compilation of symbolic shapes, enhanced headless models for pooling in Transformers backend.
Quantization
Advanced techniques: MXFP4 and bias support for Marlin kernel, NVFP4 GEMM and MoE FlashInfer backends, compressed-tensors mixed-precision model loading/
Dynamic 4-bit quantization with Kleidiai kernels for CPU, TensorRT-LLM FP4 optimized for MoE low-latency.
BitsAndBytes quantization for InternS1 and additional MoE models, Gemma3n compatibility, calibration-free RTN quantization for MoE models, ModelOpt Qwen3 NVFP4 support.
CUDA kernel optimization for Int8 per-token group quantization, non-contiguous tensor support in FP8, automatic detection of ModelOpt format.
Breaking: Removed AQLM quantization support.
API / CLI / Front-end
Unix domain socket support, improved compatibility with OpenAI API spec.
New dedicated LLM.reward interface for reward models, chunked processing for long inputs in embedding models, V1 API support for run-batch command.
Support for multiple API keys, environment variable control for logging statistics.
Custom process naming for better monitoring, improved help display showing available choices, enhanced logging of non-default arguments.
HermesToolParser for models without special tokens, multi-turn conversation benchmarking tool.
Support for “hybrid” DP LB mode, request_id support for external load balancers.
Per-request pooling control via PoolingParams.
Hundreds of other fixes and improvements to performance and function.
Articles and blog posts
Inside vLLM: Anatomy of a High-Throughput LLM Inference System
Batch inference on OpenShift AI with Ray Data, vLLM, and CodeFlare
How we leveraged vLLM to power our GenAI applications at LinkedIn
How Amazon scaled Rufus by building multi-node inference using AWS Trainium and vLLM
Integrate vLLM inference on macOS/iOS with Alamofire and Apple Foundation
Llama Stack (Inference)
Communications
Community office hours happen weekly Thursdays at 12pm EST on Discord
Llama Stack has been moved to a new GitHub organization llamastack from the originally Meta-owned meta-llama organization.
Recent releases (changelog): v0.2.20, v0.2.19, v0.2.18, v0.2.17. Notable changes related to inference include:
New features:
Added Flash-Lite 2.0 and 2.5 models to Gemini inference provider
Added Google Vertex AI inference provider support
Added batches API with OpenAI compatibility and inference replay
Added inference record/replay to increase test reliability
Bug fixes:
Fixed issues with telemetry for inference and core telemetry
Fixed missing module in inference tests
Other notable changes:
Removed SQLite dependency from inference recorder
Standardized InferenceRouter model handling
Added support for running vision inference tests
Added workflow for re-recording inference outputs
Contributors: Robert Shaw, Nick Hill, Tamar Eilam, Nir Rozenbaum, Nili Guy, Maroon Ayoub, Roy Nissim, Greg Pereira, Jooho Lee, Macio Silva, Yuan Tang