State of the Model Serving Communities - August 2025

Most recent updates from several AI/ML model inference communities that our team at Red Hat AI is contributing to.

Aug 05, 2025

Hi everyone,

We are excited to officially launch this public newsletter! This newsletter will provide you recent updates on various model serving communities, keep you informed about Red Hat AI’s contributions to upstream communities, and foster collaborations across teams and organization.

Executive Summary

WG Serving: The KubeCon proposal for WG update, "Navigating the Rapid Evolution of Large Model Inference: Where does Kubernetes Fit?", was accepted. GIE released v0.5.1 with conformance tests, a new Config API, and updated Helm charts. PRs for Flow Control (Fairness/SLOs) and the new pluggable Data Layer have been merged. Serving Catalog added a compute-class selector for H100 GPU support and ephemeral storage local SSD nodeSelector. Inference Perf released v0.1.1 with bug fixes, scalability/observability improvements, and S3 support for report storage.
KServe: The llm-d integration in KServe has seen several updates, including the implementation of reconciliation logic for single-node deployments and support for reconciling managed HTTPRoute resources. Logic for merging LLMInferenceServiceConfig for unified configuration management has been completed, and base configurations for llm-d inference service have been added. Support for various storage backends (S3, PVC, OCI) has been enabled for single-node LLMInferenceService scenarios. Production readiness has been addressed with the deployment of GIE EPP and its associated InferencePool. Model discoverability has been implemented within LLMInferenceService to enhance automation and observability, and end-to-end encryption has been implemented for all key llm-d traffic pathways.
llm-d: v0.2.0 has been released. Two talks accepted for KubeCon & PyTorch, release of v0.2.1 of inference-scheduler and v0.3.0 of inference-sim, and progress on a pluggable/extensible GIE Data Layer. A new path for precise prefix-cache aware routing (without LMCache/Redis dependency) has been established, and released v0.2.0 introducing vLLM-Native KV-Events processing, new indexing backends (including in-memory), enhanced observability with Prometheus metrics, and initial support for OpenAI chat-completions templating. Progress is being made on integrating autoscaling capabilities based on the inferno project, with a demonstration planned soon. WideEP has a working branch in vLLM. A new SIG on Observability has been kicked off, focusing on refining a north-star document, restoring Prometheus monitors for vLLM and EPP pod metrics in helm charts, and updating Grafana dashboards and PromQL queries. A proposal is underway to add distributed tracing, starting with OpenTelemetry dependencies and instrumentation to allow individual component owners to add spans and attributes.
vLLM: With the release of 0.10.0, “V0” is officially deprecated. Major focus areas continue to be cluster-scale serving of MoE models, solid support for Nvidia Blackwell GPUs including FP4, project UX including ease of setup and startup speed. Significant additions in the 0.9.2 and 0.10.0 releases include expanded Flex Attention support, more comprehensive full CUDA-graph support, experimental async scheduling, expert-parallel load-balancing and new deployment options for data parallel load-balancing.
Llama Stack (Inference): The AI Alliance officially supports Llama Stack as a foundational AI application framework. The project released v0.2.16 and v0.2.15. Key updates include automatic model registration for self-hosted Ollama and vLLM providers, a simplified starter distribution, removal of the inline vLLM provider, and significant improvements to OpenAI inference implementations and the LiteLLM backend. Dynamic model registration for OpenAI and Llama OpenAI-compatible remote inference providers has been created, alongside dynamic model detection support for inference providers using LiteLLM and infrastructure for inference model discovery.

WG Serving

Communications: #wg-serving channel in Kubernetes Slack and mailing list
Subprojects updates (note: you’ll need to join the WG Serving mailing list to access some of the documents)
- Overall organization updates:
  - The KubeCon proposal was accepted with the title Navigating the Rapid Evolution of Large Model Inference: Where does Kubernetes Fit?, with Yuan Tang and others from Google, Microsoft, and Bytedance.
- Subprojects updates:
  - Gateway API Inference Extension (GIE):
    - Release v0.5.1 is out, including the following key features:
      - Conformance Tests: Validate controller’s behavior with e2e tests covering InferencePool, InferenceModel, HTTPRoute, etc.
      - New Config API: A new Config API which allows configuring plugins through a config file without touching core code.
      - Helm Charts: Helm chart updated to support the reuse of Config API in a simple and easy manner.
    - Working towards GA of InferencePool (expected in Aug-Sep).
    - Continue to shape the successor of InferenceModel.
    - Merged initial PRs for Flow Control (Fairness/SLOs) and the new pluggable Data Layer.
  - Serving Catalog:
    - Added compute-class selector for H100 GPU for larger vLLM model support.
    - Added ephemeral storage local ssd nodeSelector to gcsfuse component.
  - Inference Perf:
    - Release v0.1.1 is out, with many bug fixes, scalability improvements, observability improvements, and s3 support for report storage. The Python package is also available on PyPI for the first time.
    - The Inference Perf team gave a presentation and demo to the community.

KServe

Communications: #kserve channel in the CNCF Slack
llm-d integration status update:
- Implemented reconciliation logic for single-node Deployment.
- Added support to reconcile managed HTTPRoute resources.
- Completed logic to merge LLMInferenceServiceConfig for unified configuration management.
- Added base configurations for llm-d inference service.
- Introduced single-node inference service base configurations for streamlined deployment.
- Enabled support for various storage backends (S3, PVC, OCI, etc.) in LLMInferenceService for single-node scenarios.
- Deployed GIE EPP and its associated InferencePool for production readiness.
- Implemented model discoverability within LLMInferenceService to enhance automation and observability.
- Implemented End-to-End Encryption for LLM-d Traffic for all key traffic pathways, including connections between the Gateway, Inference Scheduler, and vLLM pods.

llm-d

Communications: Slack, mailing list, and community meeting
SIG updates (note: you’ll need to join the llm-d mailing list to access the documents):
- Inference Scheduler
  - llm-d Scheduler KubeCon & PyTorch talks accepted
    - Serving PyTorch LLMs at Scale: Disaggregated Inference with Kubernetes and llm-d [Nili Guy & Jon Li)
    - Routing Stateful AI Workloads in Kubernetes (Maroon Ayoub & Michey Mehta)
  - Released v0.2.1 of inference-scheduler (following GIE 0.5.1 release)
  - Release v0.3.0 of inference-sim
  - Published Pluggable/Extensible GIE Data Layer, implementation in progress
- Benchmarking
  - Working on an initial implementation of “parameter sweep space”, including factors: number of prefill (p) and decode (d) pods, tensor parallelism (independent for p and d) and maximum concurrency for 1K input/1K output random data vllm-benchmark.
- PD Disaggregation
  - WideEP has a working branch in vLLM, with fixes for internal load balancing issues across DP ranks on vLLM main branch.
  - Proposal for Data Parallelism (DP) support in llm-d-inference-scheduler is looking for feedback.
- KV Disaggregation
  - Established a new well-lit path: precise prefix-cache aware routing
    - No dependency on LMCache/Redis
  - v0.2.0 Released
    - Introduced vLLM-Native KV-Events processing and new indexing backends
      - In-Memory index (default): KV-Events are digested and stored in memory
    - Enhanced observability with live Prometheus metrics for KV-Cache tracking
    - Initial support for OpenAI chat-completions templating (library)
- Installation
  - v0.2.0 released, with a focus on composability of the stack
- Autoscaling
  - We are making progress towards integrating autoscaling capabilities based on the inferno project in llm-d and are aiming to demonstrate it next week.
- Observability
  - There’s a new SIG! We kicked off SIG-Observability.
  - Refining the north-star document
  - Working to restore the Prometheus monitors for scraping vLLM and EPP pod metrics with the new llm-d-modelservice helm charts
    - podMonitors for prefill & decode pods (vLLM) are added in the next release of the charts, monitor for EPP soon to come.
  - Updating grafana dashboards and PromQL queries based on desired actionable insights collected from community devs and will give updated dashboards a permanent home.
  - Proposal to add distributed tracing, with an initial phase of adding bare minimum opentelemetry deps & instrumentation to pave the way for individual component owners to add desired spans and attributes.

vLLM

Communications: Slack
Two releases since the last update: v0.9.2, v0.10.0.
v0.9.2 was the final release supporting vLLM “V0”; related code is being removed from v0.10 onwards. Notable remaining gaps are encoder-decoder model support and custom LogitsProcessors, both of which are in-progress and expected very soon.
Focus areas: Continued optimizations for large-scale distributed serving of MoE models; Nvidia Blackwell support; Usability, server startup time in particular; V0 deprecation.
Notable additions since the last update:
- Full CUDA‑Graph execution is now fully supported for FlashAttention v3 (FA3) and FlashML, including prefix‑caching. New CUDA-graph capture progress bar.
- Priority Scheduling is now implemented for V1.
- FlexAttention improvements: now supports any head size, with FP32 fallback.
- Large-Scale Distributed Serving
  - Expert‑Parallel Load Balancer (EPLB) has been added!
  - Various prefill/decode disaggregation fixes and hardening.
  - Native xPyD P2P NCCL-based P/D connector.
  - New Data Parallel deployment options for external or hybrid load-balancing.
  - Elastic Expert Parallel for dynamic GPU scaling (ray-only so far).
- Models
  - V1 support for Embedding, Mamba2, attention-free, and SSM/Attention hybrids.
  - New families: Gemma‑3 (text‑only), Tarsier 2, Qwen 3 Embedding & Reranker, Llama 4 with EAGLE support, EXAONE 4.0, Nemotron-Nano-VL-8B-V1, GLM‑4.1 V, Keye‑VL‑8B‑Preview, MiniMax‑M1, Phi-4-mini-flash-reasoning, Voxtral, and more…
  - IBM Granite hybrid MoE configurations with shared experts are fully supported.
  - VLM improvements: VLM support with transformers backend, PrithviMAE on V1.
- Core Engine Improvements
  - Experimental async scheduling to overlap engine core scheduling with GPU runner.
  - V1 engine improvements: backend-agnostic local attention, MLA FlashInfer ragged prefill, hybrid KV cache with local chunked attention.
  - Multi-task support for models, including multiple poolers.
  - RLHF Support: new RPC methods for runtime weight reloading and config updates; logprobs mode for selecting which stage of logprobs to return.
  - Significant startup time reduction via CUDA graph capture speedup via frozen GC.
- Hardware
  - Nvidia Blackwell: SM100: block‑scaled‑group GEMM, INT8/FP8 vectorization, DeepGEMM kernels, activation‑chunking for MoE, and group‑size 64 for Machete. FlashInfer MoE blockscale FP8 backend, CUDNN prefill API for MLA, Triton Fused MoE kernel config for FP8 E=16. SM120 CUTLASS W8A8/FP8 kernels.
  - AMD ROCm: full‑graph capture and split-KV for TritonAttention, quick All‑Reduce, chunked prefill
  - TPU: dynamic‑grid KV‑cache updates, head‑dim less than 128, tuned paged‑attention kernels, and KV‑padding fixes.
  - Others: Intel GPU backend with Flash‑Attention support; ARM CPU int8 quantization; PPC64LE/ARM V1 support; Intel XPU ray distributed execution; shared-memory pipeline parallel for CPU; FlashInfer ARM CUDA support.
- Quantization
  - Calibration‑free RTN INT4/INT8 pipeline for effortless, accurate compression.
  - Compressed‑Tensor NVFP4 (including MoE) + emulation; FP4 emulation removed on < SM100 devices.
  - MoE: MXFP4 support; dynamic MoE‑layer quant (Marlin/GPTQ) and INT8 vectorization primitives; in-flight MoE quantization; broader BNB support.
  - Bits‑and‑Bytes 0.45 + with improved double‑quant logic and AWQ quality.
  - Hardware-specific: FP8 KV cache quantization on TPU, FP8 support for BatchedTritonExperts.
  - Many other performance optimizations.
- API / CLI / Front-end
  - New OpenAI‑compatible endpoints: /v1/audio/translations & revamped /v1/audio/transcriptions. OpenAI Responses API implementation.
  - Image‑object support in llm.chat, tool‑choice expansion, custom‑arg passthrough, tool-calling with required choice and $defs.
  - Various model-loading and CLI quality-of-life improvements.
- Hundreds of other fixes and improvements to performance and function.
Articles and blog posts
Videos, podcasts, and Office Hours:

Llama Stack (Inference)

Communications
- Discord server
- Community office hours happen weekly Thursdays at 12pm EST on Discord
Recent releases (changelog): v0.2.16 and v0.2.15. Notable changes related to inference include:
- Automatic model registration for self-hosted providers for Ollama and vLLM.
- Much simplified starter distribution with auto-enabled providers.
- Removed inline vLLM provider.
- Several improvements to OpenAI inference implementations and LiteLLM backend.
- Created dynamic model registration for OpenAI and Llama OpenAI compatible remote inference providers.
- Implemented dynamic model detection support for inference providers using LiteLLM.
- Added infrastructure to allow inference model discovery.
The AI Alliance officially supports Llama Stack as a foundational AI application framework designed to empower developers, enterprises, and partners in building and deploying AI applications with ease and confidence (announcement).

InferenceOps

Discussion about this post