AI Post Transformers
This episode explores TRELLIS, a bounded-memory transformer architecture that replaces the usual ever-growing key-value cache with a fixed set of learned memory slots that are rewritten during inference. It explains why long-context serving is constrained less by training-time quadratic attention than by the linear growth, latency, and fragility of KV caches, and situates TRELLIS in the progression from Transformer-XL and Compressive Transformers to ABC and GSA. The discussion highlights TRELLIS’s central idea: treating memory as fast weights for a small online regression layer, updating that memory with test-time gradient descent and state decay so the model can reconstruct useful representations while learning what to forget. Listeners would find it interesting because it connects deployment pain points in modern LLMs to a concrete alternative architecture that aims to preserve quality even as context grows while memory stays fixed. Sources: 1. TRELLIS and Bounded-Memory Transformer KV Compression https://arxiv.org/pdf/2512.23852 2. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context — Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov, 2019 https://scholar.google.com/scholar?q=Transformer-XL:+Attentive+Language+Models+Beyond+a+Fixed-Length+Context 3. Compressive Transformers for Long-Range Sequence Modelling — Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Timothy P. Lillicrap, 2020 https://scholar.google.com/scholar?q=Compressive+Transformers+for+Long-Range+Sequence+Modelling 4. Recurrent Memory Transformer — Aydar Bulatov, Yury Kuratov, Mikhail Burtsev, 2022 https://scholar.google.com/scholar?q=Recurrent+Memory+Transformer 5. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention — Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal, 2024 https://scholar.google.com/scholar?q=Leave+No+Context+Behind:+Efficient+Infinite+Context+Transformers+with+Infini-attention 6. ABC: Attention with Bounded-Memory Control — Hao Peng et al., 2021 https://scholar.google.com/scholar?q=ABC:+Attention+with+Bounded-Memory+Control 7. Gated Slot Attention for Efficient Linear-Time Sequence Modeling — Yu Zhang et al., 2024 https://scholar.google.com/scholar?q=Gated+Slot+Attention+for+Efficient+Linear-Time+Sequence+Modeling 8. Learning to (Learn at Test Time): RNNs with Expressive Hidden States — Yu Sun et al., 2024 https://scholar.google.com/scholar?q=Learning+to+(Learn+at+Test+Time):+RNNs+with+Expressive+Hidden+States 9. Lattice: Learning to Efficiently Compress the Memory — Mahdi Karami, Razvan Pascanu, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=Lattice:+Learning+to+Efficiently+Compress+the+Memory 10. You Only Cache Once: Decoder-Decoder Architectures for Language Models — Yutao Sun et al., 2024 https://scholar.google.com/scholar?q=You+Only+Cache+Once:+Decoder-Decoder+Architectures+for+Language+Models 11. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://arxiv.org/abs/2502.16002 12. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference — Yihua Cheng et al., 2025 https://arxiv.org/abs/2510.09665 13. Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference — Kexin Chu et al., 2025 https://arxiv.org/abs/2508.08438 14. SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning — Huanxuan Liao et al., 2025 https://arxiv.org/abs/2508.15212 15. Test-Time Training Provably Improves Transformers as In-context Learners — Halil Alperen Gozeten et al., 2025 https://arxiv.org/abs/2503.11842 16. Linearizing Vision Transformer with Test-Time Training — Yining Li et al., 2026 https://arxiv.org/abs/2605.02772 17. AI Post Transformers: Titans: Learning to Memorize at Test Time — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-titans-learning-to-memorize-at-test-time-054662.mp3 18. AI Post Transformers: Explicit Information Transmission for Context Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-explicit-information-transmission-for-co-24e3c2.mp3 19. AI Post Transformers: Gated Linear Attention for Efficient Long Sequences — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-gated-linear-attention-for-efficient-lon-c858ab.mp3 20. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 21. AI Post Transformers: Parallelizing DeltaNet Linear Transformers over Sequence Length — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-parallelizing-deltanet-linear-transforme-2d0377.mp3 22. AI Post Transformers: Long Context Pre-Training with Lighthouse Attention — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-long-context-pre-training-with-lighthous-e85bbe.mp3 23. AI Post Transformers: Compressed Convolutional Attention in Latent Space — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-compressed-convolutional-attention-in-la-61e1cf.mp3 Interactive Visualization: TRELLIS and Bounded-Memory Transformer KV Compression [https://podcast.do-not-panic.com/viz/2026-06-02-trellis-and-bounded-memory-transformer-k-81f237.html]
672 Folgen
Kommentare
0Sei die erste Person, die kommentiert
Melde dich jetzt an und werde Teil der AI Post Transformers-Community!