PyTorch ATX is combining with the vLLM community on September 17th for a hands-on look at the next generation of AI inference pipelines. We’ll explore the full modern stack—from aggressive model-size reductions like INT4/INT8 quantization and pruning, dynamic batching, paged-attention memory tricks, and multi-node scheduling. We'll dive into vLLM—today’s most popular open-source engine for high-throughput LLM inference— and then learn how to deploy at larger scale using the llm-d project.
Presenters include:
- “Getting started with inference using vLLM” - Steve Watt, PyTorch ambassador
- “An intermediate guide to inference using vLLM - PagedAttention, Quantization, Speculative Decoding, Continuous Batching and more” - Luka Govedič, vLLM core committer
- “vLLM Semantic Router - Intelligent Auto Reasoning Router for Efficient LLM Inference on Mixture-of-Models” - Huamin Chen, vLLM Semantic Router project creator
- “Combining Kubernetes and vLLM to deliver scalable, distributed inference with llm-d” - Greg Pereira, llm-d maintainer
Expect deeply technical talks, live demos, and open Q&A with the engineers building and running these systems.
When: September 17, 2025 - 5:30PM to 8:30PM
Where: Voltron Room - Capital Factory (1st Floor) in Austin, TX
Light food and beverages will be provided.