Discussion about this post

User's avatar
Prashanth Manohar's avatar

Thanks for the detailed update.

The focus on Scale-to-Zero support in WVA is awesoem! what the industry needs right now.

We’ve been tackling the runtime side of this challenge (achieving <2s loading via memory tiering) to make that autoscaling logic viable for latency-sensitive workloads. We actually just shared our operator architecture with Rob Shaw to see how it might align with these new llm-d patterns.

2 more comments...

No posts

Ready for more?