Augmented Mind Podcast
Woosuk Kwon is CTO of Inferact and creator of the vLLM inference library. Woosuk shares what it takes to build the most popular open-source LLM inference engine from a human-centered perspective. Outline: 0:00 - Prelude: Introducing Woosuk and Inferact 3:00 - Woosuk’s First PhD Project 6:00 - How the vLLM Project Got Started 9:18 - AI Infra Needs More Than Just Efficiency 14:08 - How AI Infra and Human-centered AI Are Connected 15:01 - How to Prioritize Feature Requests for Popular AI Infra 18:18 - Streaming Requests and Realtime API 24:05 - Multi-turn, Agentic, Proactive LLMs 27:03 - How to Design AI Infra in a Principled Way 29:13 - How to Design an AI Inference Engine for Continue Learning with RL 35:05 - Would LoRA Training Affect RL Infra Design? 37:28 - Why Start an AI Inference Infra Startup? 40:46 - What Effortless Inference with Open-source Models Means for Developers 43:46 - A Vision for On-device AI Inference 46:19- Can Today’s Coding Agents Create vLLM? References: Inferact: https://inferact.ai/ Efficient Memory Management for Large Language Model Serving with PagedAttention: https://arxiv.org/abs/2309.06180 Streaming Requests & Realtime API in vLLM: https://vllm.ai/blog/streaming-realtime RL’s Razor: Why Online Reinforcement Learning Forget Less: https://arxiv.org/abs/2509.04259 Podcast Links: Podcast website: https://augmented-mind.github.io/ Apple Podcasts: https://podcasts.apple.com/us/podcast/augmented-mind-podcast/id1868102170 Spotify: https://open.spotify.com/show/40KculkYTe2tOpqJm6TAYr?si=PU_UncsMT4mXjVNCRwoXog&nd=1&dlsi=6d9bed7a43d64085 RSS: https://anchor.fm/s/10dbf5b7c/podcast/rss About the Hosts: The AM Podcast is hosted by Yijia Shao, Shannon Shen, and Michael Ryan, CS PhD students at Stanford University and MIT.
5 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y forma parte de la comunidad de Augmented Mind Podcast!