After Titans: Behrouz on Nested Learning and Hope

Beskrivelse

Episode title: After Titans: Behrouz on Nested Learning and Hope The followup to our Titans episode, by the same core team a year later. Behrouz, Razaviyayn, Zhong, and Mirrokni (Google Research) generalize the Titans bet — that long-term memory should be a learnable module updated at test time — into a broader paradigm they call Nested Learning, where a "deep" architecture is really a hierarchy of nested optimization problems each compressing its own context flow. The episode walks through their three core contributions: (1) reframing standard optimizers like Adam and SGD-with-Momentum as associative-memory modules that compress gradient information, then proposing more expressive optimizers with their own deep memory; (2) a self-modifying sequence model whose update rule is itself learned end-to-end — the natural generalization of the test-time-learnable memory module Titans introduced; (3) a continuum memory system that replaces the traditional short-term-vs-long-term dichotomy with a continuum across multiple update rates. Combining the self-modifying module with the continuum memory system produces Hope, a continual-learning architecture reported promising on language modeling, knowledge incorporation, few-shot generalization, continual learning, and long-context reasoning. The hosts treat Hope as the next concrete instance of the Titans → Nested Learning research arc, stress-test the novelty against established meta-learning and fast-weights literature, and distinguish what the paper actually shows from what its framing suggests. Sources: Nested Learning: The Illusion of Deep Learning Architectures — Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni (Google Research, NeurIPS 2025) — arXiv 2512.24695 https://arxiv.org/pdf/2512.24695 Titans: Learning to Memorize at Test Time — Ali Behrouz, Peilin Zhong, Vahab Mirrokni (Jan 2025) — arXiv 2501.00663 https://arxiv.org/pdf/2501.00663 AI Post Transformers — "Titans: Learning to Memorize at Test Time" https://podcast.do-not-panic.com/episodes/2026-05-20-titans-learning-to-memorize-at-test-time-054662.mp3 Interactive Visualization: After Titans: Behrouz on Nested Learning and Hope [https://podcast.do-not-panic.com/viz/2026-05-20-nested-learning-beyond-deep-architecture-7cc949.html]

DFX: Multi-FPGA Acceleration for Transformer Inference

This episode explores the DFX system, a four-FPGA appliance designed to accelerate transformer-based text generation by targeting a key weakness of GPUs: low-batch, token-by-token decode. It explains the difference between prompt processing and sequential generation, connects the paper’s older terminology to today’s prefill/decode framing, and shows why autoregressive inference often leaves GPU hardware underused even when training runs efficiently in parallel. The discussion also breaks down how DFX uses hardware-aware model parallelism and end-to-end accelerator design, rather than only speeding up isolated transformer subcomponents, to argue for lower latency and better energy and cost efficiency than a four-V100 GPU server. Listeners would find it interesting for its clear historical perspective on transformer serving and for its skepticism about how much of the reported advantage comes from FPGA specialization versus the fairness of the GPU baseline. Sources: 1. DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation — Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim, Dongsoo Lee, Joo-Young Kim, 2022 http://arxiv.org/abs/2209.10797 2. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism — Yanping Huang, Youlong Cheng, Ankur Bapna, Quoc V. Le, Yonghui Wu, Zhifeng Chen, and others, 2019 https://scholar.google.com/scholar?q=GPipe:+Efficient+Training+of+Giant+Neural+Networks+using+Pipeline+Parallelism 3. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2020 https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism 4. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding — Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Noam Shazeer, Zhifeng Chen, and others, 2020 https://scholar.google.com/scholar?q=GShard:+Scaling+Giant+Models+with+Conditional+Computation+and+Automatic+Sharding 5. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM — Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia, and others, 2021 https://scholar.google.com/scholar?q=Efficient+Large-Scale+Language+Model+Training+on+GPU+Clusters+Using+Megatron-LM 6. Attention Is All You Need — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017 https://scholar.google.com/scholar?q=Attention+Is+All+You+Need 7. FTRANS: Energy-Efficient Acceleration of Transformers using FPGA — Jingcheng Rao, Yuchen Shao, Ke Wang, Zhihao Zhu, Xuehai Qian, Yiyu Shi, 2020 https://scholar.google.com/scholar?q=FTRANS:+Energy-Efficient+Acceleration+of+Transformers+using+FPGA 8. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2022 https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding 9. PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference — Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao, 2024 https://scholar.google.com/scholar?q=PyramidInfer:+Pyramid+KV+Cache+Compression+for+High-throughput+LLM+Inference 10. ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference — Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Bo Li, Xuming Hu, Xiaowen Chu, 2025 https://scholar.google.com/scholar?q=ChunkKV:+Semantic-Preserving+KV+Cache+Compression+for+Efficient+Long-Context+LLM+Inference 11. Cost-Optimal Grouped-Query Attention for Long-Context LLMs — Yingfa Chen, Yutong Wu, Xu Han, Zhiyuan Liu, Maosong Sun, 2025 https://scholar.google.com/scholar?q=Cost-Optimal+Grouped-Query+Attention+for+Long-Context+LLMs 12. Optimised Grouped-Query Attention Mechanism for Transformers — Yuang Chen, Cheng Zhang, Xitong Gao, Robert D. Mullins, George A. Constantinides, Yiren Zhao, 2024 https://scholar.google.com/scholar?q=Optimised+Grouped-Query+Attention+Mechanism+for+Transformers 13. Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving — Chao Wang, Pengfei Zuo, Zhangyu Chen, Yunkai Liang, Zhou Yu, Ming-Chang Yang, 2025 https://scholar.google.com/scholar?q=Prefill-Decode+Aggregation+or+Disaggregation?+Unifying+Both+for+Goodput-Optimized+LLM+Serving 14. Nexus: Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving — Xiaoxiang Shi, Colin Cai, Junjia Du, Zhihao Jia, 2025 https://scholar.google.com/scholar?q=Nexus:+Proactive+Intra-GPU+Disaggregation+of+Prefill+and+Decode+in+LLM+Serving 15. SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference — Hengrui Zhang, Pratyush Patel, August Ning, David Wentzlaff, 2025 https://scholar.google.com/scholar?q=SPAD:+Specialized+Prefill+and+Decode+Hardware+for+Disaggregated+LLM+Inference 16. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 17. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 18. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3 19. AI Post Transformers: Prefill-as-a-Service for Cross-Datacenter KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-prefill-as-a-service-for-cross-datacente-7560be.mp3 20. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 21. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 22. AI Post Transformers: Caffeine: A Unified FPGA for CNNs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-06-caffeine-a-unified-fpga-for-cnns-e8acbe.mp3 Interactive Visualization: DFX: Multi-FPGA Acceleration for Transformer Inference [https://podcast.do-not-panic.com/viz/2026-05-22-dfx-multi-fpga-acceleration-for-transfor-3266ea.html]

27. mai 20261 h 0 min

After Titans: Behrouz on Nested Learning and Hope

Beskrivelse

Kommentarer

2 Måneder for 19 kr

Alle episoder