Imagen de portada del programa AI Post Transformers

AI Post Transformers

Podcast de mcgrof

inglés

Tecnología y ciencia

Empieza 7 días de prueba

$99 / mes después de la prueba.Cancela cuando quieras.

  • 20 horas de audiolibros al mes
  • Podcasts solo en Podimo
  • Podcast gratuitos
Prueba gratis

Acerca de AI Post Transformers

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

Todos los episodios

658 episodios

episode LPU Chip for Low-Latency LLM Inference artwork

LPU Chip for Low-Latency LLM Inference

This episode explores a 2024 paper on the LPU, a custom processor designed specifically for large language model inference, with an emphasis on reducing the per-token delay that users notice in interactive systems. It explains why autoregressive decoding is often limited by memory movement and synchronization rather than raw compute, making conventional GPU strengths less decisive in small-batch, user-facing generation. The discussion highlights the paper’s full-stack argument: a specialized chip, a supporting software stack called HyperDex, and a multi-device link meant to preserve low latency while scaling across processors. Listeners would find it interesting because it reframes AI hardware performance around real conversational responsiveness and digs into whether the paper’s bold efficiency and scaling claims actually hold up under careful comparison. Sources: 1. LPU Chip for Low-Latency LLM Inference https://arxiv.org/pdf/2408.07326 2. DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation — Seongmin Hong, et al., 2022 https://scholar.google.com/scholar?q=DFX:+A+Low-latency+Multi-FPGA+Appliance+for+Accelerating+Transformer-based+Text+Generation 3. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning — Hanrui Wang, Zhekai Zhang, Song Han, 2021 https://scholar.google.com/scholar?q=SpAtten:+Efficient+Sparse+Attention+Architecture+with+Cascade+Token+and+Head+Pruning 4. A Software-Defined Tensor Streaming Multiprocessor for Large-Scale Machine Learning — Dennis Abts, et al., 2022 https://scholar.google.com/scholar?q=A+Software-Defined+Tensor+Streaming+Multiprocessor+for+Large-Scale+Machine+Learning 5. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, et al., 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 6. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Zhao, et al., 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 7. RazorAttention: Efficient KV Cache Compression Through Retrieval Heads — Hanlin Tang et al., 2024 https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+Through+Retrieval+Heads 8. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — Yu Fu et al., 2024 https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning 9. Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction — Haoran Qiu et al., 2024 https://scholar.google.com/scholar?q=Efficient+Interactive+LLM+Serving+with+Proxy+Model-based+Sequence+Length+Prediction 10. Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving — Ke Cheng et al., 2024 https://scholar.google.com/scholar?q=Slice-Level+Scheduling+for+High+Throughput+and+Load+Balanced+LLM+Serving 11. Deferred Continuous Batching in Resource-Efficient Large Language Model Serving — Yongjun He, Yao Lu, Gustavo Alonso, 2024 https://scholar.google.com/scholar?q=Deferred+Continuous+Batching+in+Resource-Efficient+Large+Language+Model+Serving 12. TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference — Raja Gond, Nipun Kwatra, Ramachandran Ramjee, 2025 https://scholar.google.com/scholar?q=TokenWeave:+Efficient+Compute-Communication+Overlap+for+Distributed+LLM+Inference 13. Characterizing Communication Patterns in Distributed Large Language Model Inference — Lang Xu et al., 2025 https://scholar.google.com/scholar?q=Characterizing+Communication+Patterns+in+Distributed+Large+Language+Model+Inference 14. Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference — Pol G. Recasens et al., 2025 https://scholar.google.com/scholar?q=Mind+the+Memory+Gap:+Unveiling+GPU+Bottlenecks+in+Large-Batch+LLM+Inference 15. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 16. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 17. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 18. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 19. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 20. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3 21. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 22. AI Post Transformers: JANUS for Scalable MoE Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-janus-for-scalable-moe-inference-78ae30.mp3 Interactive Visualization: LPU Chip for Low-Latency LLM Inference [https://podcast.do-not-panic.com/viz/2026-05-20-lpu-chip-for-low-latency-llm-inference-be13c3.html]

23 de may de 2026 - 1 h 0 min
episode After Titans: Behrouz on Nested Learning and Hope artwork

After Titans: Behrouz on Nested Learning and Hope

Episode title: After Titans: Behrouz on Nested Learning and Hope The followup to our Titans episode, by the same core team a year later. Behrouz, Razaviyayn, Zhong, and Mirrokni (Google Research) generalize the Titans bet — that long-term memory should be a learnable module updated at test time — into a broader paradigm they call Nested Learning, where a "deep" architecture is really a hierarchy of nested optimization problems each compressing its own context flow. The episode walks through their three core contributions: (1) reframing standard optimizers like Adam and SGD-with-Momentum as associative-memory modules that compress gradient information, then proposing more expressive optimizers with their own deep memory; (2) a self-modifying sequence model whose update rule is itself learned end-to-end — the natural generalization of the test-time-learnable memory module Titans introduced; (3) a continuum memory system that replaces the traditional short-term-vs-long-term dichotomy with a continuum across multiple update rates. Combining the self-modifying module with the continuum memory system produces Hope, a continual-learning architecture reported promising on language modeling, knowledge incorporation, few-shot generalization, continual learning, and long-context reasoning. The hosts treat Hope as the next concrete instance of the Titans → Nested Learning research arc, stress-test the novelty against established meta-learning and fast-weights literature, and distinguish what the paper actually shows from what its framing suggests. Sources: Nested Learning: The Illusion of Deep Learning Architectures — Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni (Google Research, NeurIPS 2025) — arXiv 2512.24695 https://arxiv.org/pdf/2512.24695 Titans: Learning to Memorize at Test Time — Ali Behrouz, Peilin Zhong, Vahab Mirrokni (Jan 2025) — arXiv 2501.00663 https://arxiv.org/pdf/2501.00663 AI Post Transformers — "Titans: Learning to Memorize at Test Time" https://podcast.do-not-panic.com/episodes/2026-05-20-titans-learning-to-memorize-at-test-time-054662.mp3 Interactive Visualization: After Titans: Behrouz on Nested Learning and Hope [https://podcast.do-not-panic.com/viz/2026-05-20-nested-learning-beyond-deep-architecture-7cc949.html]

Ayer - 1 h 0 min
episode Titans: Learning to Memorize at Test Time artwork

Titans: Learning to Memorize at Test Time

This episode explores the Titans paper’s proposal to pair standard attention with a separate learned long-term memory that updates during inference, aiming to preserve distant information without paying full quadratic attention costs across very long sequences. It situates that idea against earlier approaches such as Neural Turing Machines, Transformer-XL, Compressive Transformers, Memorizing Transformers, and linear-attention recurrent models, highlighting the recurring tradeoff between precise recall and scalable memory. The discussion focuses on the paper’s most distinctive claim: memory writes are driven by a loss-based notion of surprise, making test-time memory updates look more like small online learning steps than a simple cache. Listeners would find it interesting because it gets at a central open question in modern AI systems design: whether neural networks can gain durable, useful memory at inference time without becoming too unstable, expensive, or operationally awkward to deploy. Sources: 1. Titans: Learning to Memorize at Test Time https://arxiv.org/pdf/2501.00663 2. Neural Turing Machines — Alex Graves, Greg Wayne, Ivo Danihelka, 2014 https://arxiv.org/abs/1410.5401 3. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context — Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov, 2019 https://arxiv.org/abs/1901.02860 4. Compressive Transformers for Long-Range Sequence Modelling — Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, Timothy P. Lillicrap, 2020 https://openreview.net/forum?id=SylKikSYDH 5. Memorizing Transformers — Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, Christian Szegedy, 2022 https://arxiv.org/abs/2203.08913 6. Learning to (learn at test time): RNNs with expressive hidden states — Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, and Sanmi Koyejo, 2024 https://scholar.google.com/scholar?q=Learning+to+(learn+at+test+time):+RNNs+with+expressive+hidden+states 7. Gated Delta Networks: Improving Mamba2 with Delta Rule — Songlin Yang, Jan Kautz, and Ali Hatamizadeh, 2024 https://scholar.google.com/scholar?q=Gated+Delta+Networks:+Improving+Mamba2+with+Delta+Rule 8. RULER: What's the Real Context Size of Your Long-Context Language Models? — Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg, 2024 https://scholar.google.com/scholar?q=RULER:+What's+the+Real+Context+Size+of+Your+Long-Context+Language+Models? 9. BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack — Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev, 2024 https://scholar.google.com/scholar?q=BABILong:+Testing+the+Limits+of+LLMs+with+Long+Context+Reasoning-in-a-Haystack 10. ATLAS: Learning to Optimally Memorize the Context at Test Time — Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=ATLAS:+Learning+to+Optimally+Memorize+the+Context+at+Test+Time 11. KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference — Alireza Nadali, Patrick Cooper, Ashutosh Trivedi, Alvaro Velasquez, 2026 https://scholar.google.com/scholar?q=KV-Fold:+One-Step+KV-Cache+Recurrence+for+Long-Context+Inference 12. SCBench: A KV Cache-Centric Analysis of Long-Context Methods — Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu, 2024/2025 https://scholar.google.com/scholar?q=SCBench:+A+KV+Cache-Centric+Analysis+of+Long-Context+Methods 13. Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling — Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, Weizhu Chen, 2024 https://scholar.google.com/scholar?q=Samba:+Simple+Hybrid+State+Space+Models+for+Efficient+Unlimited+Context+Language+Modeling 14. Longhorn: State Space Models are Amortized Online Learners — Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, Qiang Liu, 2024 https://scholar.google.com/scholar?q=Longhorn:+State+Space+Models+are+Amortized+Online+Learners 15. Retrieval meets Long Context Large Language Models — Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, Bryan Catanzaro, 2023 https://scholar.google.com/scholar?q=Retrieval+meets+Long+Context+Large+Language+Models 16. Augmenting Language Models with Long-Term Memory — Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, Furu Wei, 2023 https://scholar.google.com/scholar?q=Augmenting+Language+Models+with+Long-Term+Memory 17. Test-Time Training Provably Improves Transformers as In-Context Learners — Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Mahdi Soltanolkotabi, Marco Mondelli, Samet Oymak, 2025 https://scholar.google.com/scholar?q=Test-Time+Training+Provably+Improves+Transformers+as+In-Context+Learners 18. AI Post Transformers: δ-mem and Online Memory for LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-d-mem-and-online-memory-for-llms-6622fa.mp3 19. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 20. AI Post Transformers: Gated Linear Attention for Efficient Long Sequences — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-gated-linear-attention-for-efficient-lon-c858ab.mp3 21. AI Post Transformers: MELT: Decoupling Compute From Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-melt-decoupling-compute-from-memory-26430c.mp3 22. AI Post Transformers: Long Context Pre-Training with Lighthouse Attention — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-long-context-pre-training-with-lighthous-e85bbe.mp3 23. AI Post Transformers: Training Million-Token LLMs Beyond the Memory Barrier — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-training-million-token-llms-beyond-the-m-324edc.mp3 24. AI Post Transformers: Recursive Language Models for Arbitrarily Long Prompts — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-recursive-language-models-for-arbitraril-fbcd1c.mp3 25. AI Post Transformers: How Induction Heads Emerge in Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-03-how-induction-heads-emerge-in-transforme-a7bfcb.mp3 26. AI Post Transformers: DeepSeek-V4 and Practical Million-Token Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-deepseek-v4-and-practical-million-token-6f4de1.mp3 Interactive Visualization: Titans: Learning to Memorize at Test Time [https://podcast.do-not-panic.com/viz/2026-05-20-titans-learning-to-memorize-at-test-time-054662.html]

20 de may de 2026 - 1 h 0 min
episode The Sparsity Wall: What Reiner Pope Told Dwarkesh About MoE and Sparse Attention artwork

The Sparsity Wall: What Reiner Pope Told Dwarkesh About MoE and Sparse Attention

Episode title: The Sparsity Wall: What Reiner Pope Told Dwarkesh About MoE and Sparse Attention A two-paper deep dive framed around the Dwarkesh Patel x Reiner Pope blackboard lecture on training and serving frontier LLMs. The hosts work through "Unified Scaling Laws for Routed Language Models" (Clark et al., DeepMind 2022, arXiv 2202.01169) for the mixture-of-experts side and the DeepSeek sparse-attention paper (arXiv 2512.02556) for the attention side, treating Pope's blackboard framing on the podcast as the pedagogical lens. The episode separates what the papers establish from what Pope's practitioner intuition adds on top, with particular attention to how MoE on the FFN side and sparse attention on the QK side attack independent cost pools and can compound rather than compete. Sources: arXiv 2202.01169 — "Unified Scaling Laws for Routed Language Models" https://arxiv.org/pdf/2202.01169 arXiv 2512.02556 — DeepSeek sparse-attention paper https://arxiv.org/pdf/2512.02556 Dwarkesh Podcast — "Reiner Pope: The math behind how LLMs are trained and served" (April 29 2026) https://www.dwarkesh.com/p/reiner-pope Transcript: https://gist.github.com/dwarkeshsp/79100f0fdeed69d76241903bb0604dbe Older MoE context: GShard (arXiv 2006.16668), Switch Transformer (arXiv 2101.03961) Chinchilla scaling laws (arXiv 2203.15556) — referenced in the Pope episode Interactive Visualization: The Sparsity Wall: What Reiner Pope Told Dwarkesh About MoE and Sparse Attention [https://podcast.do-not-panic.com/viz/2026-05-16-the-sparsity-wall-of-modern-llms-8125d4.html]

19 de may de 2026 - 1 h 0 min
episode Affordable Large-Scale Decoding Through Model-System Co-Design artwork

Affordable Large-Scale Decoding Through Model-System Co-Design

This episode explores the paper’s claim that decoding cost in large language models is driven less by raw parameter counts and more by hardware-level behavior during autoregressive generation, especially memory bandwidth pressure from the KV cache. It explains why metrics like total or activated parameters can be misleading cost proxies, and walks through the tradeoffs among standard attention, grouped-query variants, and newer approaches such as MFA that aim to preserve expressive power while reducing cache overhead. The discussion also highlights the paper’s central systems argument: attention and FFN layers have very different performance bottlenecks, so separating them through Attention-FFN Disaggregation can make large models cheaper to serve without sacrificing capability. A listener would find it interesting for its concrete, skeptical look at why inference efficiency depends on model-system co-design rather than headline model size alone. Sources: 1. Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — StepFun, :, Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li, Jingcheng Hu, Ka Man Lo, Ailin Huang, Binxing Jiao, Bo Li, Boyu Chen, Changxin Miao, Chang Lou, Chen Hu, Chen Xu, Chenfeng Yu, Chengyuan Yao, Daokuan Lv, Dapeng Shi, Deshan Sun, Ding Huang, Dingyuan Hu, Dongqing Pang, Enle Liu, Fajie Zhang, Fanqi Wan, Gulin Yan, Han Zhang, Han Zhou, Hanghao Wu, Hangyu Guo, Hanqi Chen, Hanshan Zhang, Hao Wu, Haocheng Zhang, Haolong Yan, Haoran Lv, Haoran Wei, Hebin Zhou, Heng Wang, Heng Wang, Hongxin Li, Hongyu Zhou, Hongyuan Wang, Huiyong Guo, Jia Wang, Jiahao Gong, Jialing Xie, Jian Zhou, Jianjian Sun, Jiaoren Wu, Jiaran Zhang, Jiayu Liu, Jie Cheng, Jie Luo, Jie Yan, Jie Yang, Jieyi Hou, Jinguang Zhang, Jinlan Cao, Jisheng Yin, Junfeng Liu, Junhao Huang, Junzhe Lin, Kaijun Tan, Kaixiang Li, Kang An, Kangheng Lin, Kenkun Liu, Lei Yang, Liang Zhao, Liangyu Chen, Lieyu Shi, Liguo Tan, Lin Lin, Lin Zhang, Lina Chen, Liwen Huang, Liying Shi, Longlong Gu, Mei Chen, Mengqiang Ren, Ming Li, Mingzhe Chen, Na Wang, Nan Wu, Qi Han, Qian Zhao, Qiang Zhang, Qianni Liu, Qiaohui Chen, Qiling Wu, Qinglin He, Qinyuan Tan, Qiufeng Wang, Qiuping Wu, Qiuyan Liang, Quan Sun, Rui Li, Ruihang Miao, Ruosi Wan, Ruyan Guo, Shangwu Zhong, Shaoliang Pang, Shengjie Fan, Shijie Shang, Shilei Jiang, Shiliang Yang, Shiming Hao, Shuli Gao, Siming Huang, Siqi Liu, Tiancheng Cao, Tianhao Cheng, Tianhao Peng, Wang You, Wei Ji, Wen Sun, Wenjin Deng, Wenqing He, Wenzhen Zheng, Xi Chen, Xiangwen Kong, Xianzhen Luo, Xiaobo Yang, Xiaojia Liu, Xiaoxiao Ren, Xin Han, Xin Li, Xin Wu, Xu Zhao, Yanan Wei, Yang Li, Yangguang Li, Yangshijie Xu, Yanming Xu, Yaqiang Shi, Yeqing Shen, Yi Yang, Yifei Yang, Yifeng Gong, Yihan Chen, Yijing Yang, Yinmin Zhang, Yizhuang Zhou, Yuanhao Ding, Yuantao Fan, Yuanzhen Yang, Yuchu Luo, Yue Peng, Yufan Lu, Yuhang Deng, Yuhe Yin, Yujie Liu, Yukun Chen, Yuling Zhao, Yun Mou, Yunlong Li, Yunzhou Ju, Yusheng Li, Yuxiang Yang, Yuxiang Zhang, Yuyang Chen, Zejia Weng, Zhe Xie, Zheng Ge, Zheng Gong, Zhenyi Lu, Zhewei Huang, Zhichao Chang, Zhiguo Huang, Zhirui Wang, Zidong Yang, Zili Wang, Ziqi Wang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Xiangyu Zhang, 2025 http://arxiv.org/abs/2507.19427 2. Fast Transformer Decoding: One Write-Head is All You Need — Noam Shazeer, 2019 https://scholar.google.com/scholar?q=Fast+Transformer+Decoding:+One+Write-Head+is+All+You+Need 3. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, Sumit Sanghai, 2023 https://scholar.google.com/scholar?q=GQA:+Training+Generalized+Multi-Query+Transformer+Models+from+Multi-Head+Checkpoints 4. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — Zhihong Shao and DeepSeek-AI et al., 2024 https://scholar.google.com/scholar?q=DeepSeek-V2:+A+Strong,+Economical,+and+Efficient+Mixture-of-Experts+Language+Model 5. Multi-matrix Factorization Attention — Jingcheng Hu, Houyi Li, Yinmin Zhang, Zili Wang, Shuigeng Zhou, Xiangyu Zhang, Heung-Yeung Shum, Daxin Jiang, 2024 https://scholar.google.com/scholar?q=Multi-matrix+Factorization+Attention 6. Splitwise: Efficient generative LLM inference using phase splitting — Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Inigo Goiri, Saeed Maleki, Ricardo Bianchini, 2023 https://scholar.google.com/scholar?q=Splitwise:+Efficient+generative+LLM+inference+using+phase+splitting 7. P/D-Serve: Serving Disaggregated Large Language Model at Scale — Yibo Jin, Tao Wang, Huimin Lin and Huawei colleagues, 2024 https://scholar.google.com/scholar?q=P/D-Serve:+Serving+Disaggregated+Large+Language+Model+at+Scale 8. MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism — Ruidong Zhu, Ziheng Jiang, Chao Jin and ByteDance colleagues, 2025 https://scholar.google.com/scholar?q=MegaScale-Infer:+Serving+Mixture-of-Experts+at+Scale+with+Disaggregated+Expert+Parallelism 9. Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — StepFun et al., 2025 https://scholar.google.com/scholar?q=Step-3+is+Large+yet+Affordable:+Model-system+Co-design+for+Cost-effective+Decoding 10. DeepSeek-V3 Technical Report — DeepSeek-AI et al., 2024 https://scholar.google.com/scholar?q=DeepSeek-V3+Technical+Report 11. Qwen3 MoE 235B — Qwen Team / Alibaba researchers, 2025 https://scholar.google.com/scholar?q=Qwen3+MoE+235B 12. Prefill-Decode Disaggregation — Relevant serving-systems authors cited as [18, 31], 2024-2025 https://scholar.google.com/scholar?q=Prefill-Decode+Disaggregation 13. Kimi K2 Technical Report — Moonshot AI et al., 2025 https://scholar.google.com/scholar?q=Kimi+K2+Technical+Report 14. MiniMax M1 — MiniMax researchers, 2025 https://scholar.google.com/scholar?q=MiniMax+M1 15. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 16. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — Yuwei An et al., 2025 https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse 17. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — Shihao Wang et al., 2026 https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation 18. HyperAttention: Long-context Attention in Near-Linear Time — Insu Han et al., 2023 https://scholar.google.com/scholar?q=HyperAttention:+Long-context+Attention+in+Near-Linear+Time 19. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention — Tsendsuren Munkhdalai et al., 2024 https://scholar.google.com/scholar?q=Leave+No+Context+Behind:+Efficient+Infinite+Context+Transformers+with+Infini-attention 20. Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning — Ling Team et al., 2025 https://scholar.google.com/scholar?q=Every+Attention+Matters:+An+Efficient+Hybrid+Architecture+for+Long-Context+Reasoning 21. KVDirect: Distributed Disaggregated LLM Inference — Shiyang Chen et al., 2024 https://scholar.google.com/scholar?q=KVDirect:+Distributed+Disaggregated+LLM+Inference 22. HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment — Youhe Jiang et al., 2025 https://scholar.google.com/scholar?q=HexGen-2:+Disaggregated+Generative+Inference+of+LLMs+in+Heterogeneous+Environment 23. GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference — Yu Han et al., 2025 https://scholar.google.com/scholar?q=GRACE-MoE:+Grouping+and+Replication+with+Locality-Aware+Routing+for+Efficient+Distributed+MoE+Inference 24. AI Post Transformers: JANUS for Scalable MoE Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-janus-for-scalable-moe-inference-78ae30.mp3 25. AI Post Transformers: Prefill-as-a-Service for Cross-Datacenter KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-prefill-as-a-service-for-cross-datacente-7560be.mp3 26. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3 27. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 28. AI Post Transformers: NanoFlow and the Future of LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-nanoflow-and-the-future-of-llm-serving-7429c9.mp3 29. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 30. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 31. AI Post Transformers: Nemotron 3 Super Hybrid Mamba-Transformer MoE — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-nemotron-3-super-hybrid-mamba-transforme-31ac75.mp3 32. AI Post Transformers: FlatAttention for Tile-Based Accelerator Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-flatattention-for-tile-based-accelerator-56e6ca.mp3 Interactive Visualization: Affordable Large-Scale Decoding Through Model-System Co-Design [https://podcast.do-not-panic.com/viz/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.html]

19 de may de 2026 - 1 h 0 min
Muy buenos Podcasts , entretenido y con historias educativas y divertidas depende de lo que cada uno busque. Yo lo suelo usar en el trabajo ya que estoy muchas horas y necesito cancelar el ruido de al rededor , Auriculares y a disfrutar ..!!
Muy buenos Podcasts , entretenido y con historias educativas y divertidas depende de lo que cada uno busque. Yo lo suelo usar en el trabajo ya que estoy muchas horas y necesito cancelar el ruido de al rededor , Auriculares y a disfrutar ..!!
Fantástica aplicación. Yo solo uso los podcast. Por un precio módico los tienes variados y cada vez más.
Me encanta la app, concentra los mejores podcast y bueno ya era ora de pagarles a todos estos creadores de contenido

Elige tu suscripción

Más populares

Premium

20 horas de audiolibros

  • Podcasts solo en Podimo

  • Disfruta los shows de Podimo sin anuncios

  • Cancela cuando quieras

Empieza 7 días de prueba
Después $99 / mes

Prueba gratis

Sólo en Podimo

Audiolibros populares

Prueba gratis

Empieza 7 días de prueba. $99 / mes después de la prueba. Cancela cuando quieras.