Opaque Serial Depth and Chain-of-Thought Limits

Beskrivelse

This episode explores a 2026 paper on whether language models can keep doing long, step-by-step reasoning internally or eventually need to expose some of that reasoning in visible chain-of-thought tokens. It explains the paper’s core idea of opaque serial depth, or a model’s hidden reasoning horizon, and argues that this is a better safety-relevant measure than raw model size, parameter count, or informal layer counting. The discussion connects that metric to circuit complexity, fixed-precision computation, and transformer internals, showing why models can perform huge amounts of parallel work in one pass yet still face structural limits on long private sequential reasoning. Listeners would find it interesting because it sharpens a major AI safety question: whether monitoring visible reasoning can meaningfully constrain powerful models, and where that hope may break down. Sources: 1. Opaque Serial Depth and Chain-of-Thought Limits https://arxiv.org/pdf/2603.09786 2. Quantifying the Necessity of Chain of Thought through Opaque Serial Depth — Jonah Brown-Cohen, David Lindner, Rohin Shah, 2026 https://scholar.google.com/scholar?q=Quantifying+the+Necessity+of+Chain+of+Thought+through+Opaque+Serial+Depth 3. Chain of Thought Empowers Transformers to Solve Inherently Serial Problems — Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma, 2024 https://scholar.google.com/scholar?q=Chain+of+Thought+Empowers+Transformers+to+Solve+Inherently+Serial+Problems 4. Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety — Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Mark Chen, Rohin Shah, et al., 2025 https://scholar.google.com/scholar?q=Chain+of+Thought+Monitorability:+A+New+and+Fragile+Opportunity+for+AI+Safety 5. When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors — Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, Rohin Shah, 2025 https://scholar.google.com/scholar?q=When+Chain+of+Thought+is+Necessary,+Language+Models+Struggle+to+Evade+Monitors 6. Saturated Transformers are Constant-Depth Threshold Circuits — William Merrill, Ashish Sabharwal, Noah A. Smith, 2022 https://scholar.google.com/scholar?q=Saturated+Transformers+are+Constant-Depth+Threshold+Circuits 7. The Parallelism Tradeoff: Limitations of Log-Precision Transformers — William Merrill, Ashish Sabharwal, 2023 https://scholar.google.com/scholar?q=The+Parallelism+Tradeoff:+Limitations+of+Log-Precision+Transformers 8. The Expressive Power of Transformers with Chain of Thought — William Merrill, Ashish Sabharwal, 2024 https://scholar.google.com/scholar?q=The+Expressive+Power+of+Transformers+with+Chain+of+Thought 9. Theoretical Limitations of Self-Attention in Neural Sequence Models — Michael Hahn, 2020 https://scholar.google.com/scholar?q=Theoretical+Limitations+of+Self-Attention+in+Neural+Sequence+Models 10. Continuous Chain of Thought Enables Parallel Exploration and Reasoning — Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, Samet Oymak, 2025 https://scholar.google.com/scholar?q=Continuous+Chain+of+Thought+Enables+Parallel+Exploration+and+Reasoning 11. Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought — Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, Jack Merullo, 2026 https://scholar.google.com/scholar?q=Reasoning+Theater:+Disentangling+Model+Beliefs+from+Chain-of-Thought 12. Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps — Martin Tutek et al., 2025 https://scholar.google.com/scholar?q=Measuring+Chain+of+Thought+Faithfulness+by+Unlearning+Reasoning+Steps 13. Counterfactual Simulation Training for Chain-of-Thought Faithfulness — Peter Hase, Christopher Potts, 2026 https://scholar.google.com/scholar?q=Counterfactual+Simulation+Training+for+Chain-of-Thought+Faithfulness 14. Why Models Know But Don't Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models — Richard J. Young, 2026 https://scholar.google.com/scholar?q=Why+Models+Know+But+Don't+Say:+Chain-of-Thought+Faithfulness+Divergence+Between+Thinking+Tokens+and+Answers+in+Open-Weight+Reasoning+Models 15. Reasoning with Latent Thoughts: On the Power of Looped Transformers — Nikunj Saunshi et al., 2025 https://scholar.google.com/scholar?q=Reasoning+with+Latent+Thoughts:+On+the+Power+of+Looped+Transformers 16. Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding — Haolin Chen et al., 2024 https://scholar.google.com/scholar?q=Language+Models+are+Hidden+Reasoners:+Unlocking+Latent+Reasoning+Capabilities+via+Self-Rewarding 17. Efficient Post-Training Refinement of Latent Reasoning in Large Language Models — Xinyuan Wang et al., 2025 https://scholar.google.com/scholar?q=Efficient+Post-Training+Refinement+of+Latent+Reasoning+in+Large+Language+Models 18. AI Post Transformers: Reasoning Theater and Unfaithful Chain-of-Thought — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-reasoning-theater-and-unfaithful-chain-o-a4507e.mp3 19. AI Post Transformers: Generative Recursive Reasoning in Latent Space — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-21-generative-recursive-reasoning-in-latent-a9371d.mp3 20. AI Post Transformers: How Models Detect Hidden Activation Steering — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-08-how-models-detect-hidden-activation-stee-577f73.mp3 21. AI Post Transformers: Latent Space as a New Computational Paradigm — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-05-latent-space-as-a-new-computational-para-810f39.mp3 22. AI Post Transformers: Neural Computers as Learned Latent Runtimes — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-11-neural-computers-as-learned-latent-runti-9fa282.mp3 Interactive Visualization: Opaque Serial Depth and Chain-of-Thought Limits [https://podcast.do-not-panic.com/viz/2026-06-03-opaque-serial-depth-and-chain-of-thought-e07fc1.html]

Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode

This episode explores why batch-1 LLM decode for robots, edge copilots, and other single-session agents behaves very differently from high-throughput serving, and why next-token latency cannot be explained by memory bandwidth alone. It breaks down the paper’s main test: compare real decode time against an analytic memory floor based on model-weight and KV-cache traffic, then run that across Qwen-2.5-7B, Mistral-7B-v0.3, and Llama-3.1-8B on L4, L40S, A100, and H100 GPUs over contexts from 2048 to 16384. The discussion argues that because these models already use grouped-query attention to cut KV traffic, the remaining latency gap is driven by runtime details such as CUDA Graphs, launch overhead, kernel quality, and whether quantization actually helps in this tiny decode regime. Listeners would find it interesting because it challenges the simple idea that buying a faster-memory GPU automatically lowers token latency, especially for physical AI systems where one delayed token can stall the whole interaction. Sources: 1. Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode https://arxiv.org/pdf/2605.30571 2. Orca: A Distributed Serving System for Transformer-Based Generative Models — Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun, 2022 https://scholar.google.com/scholar?q=Orca:+A+Distributed+Serving+System+for+Transformer-Based+Generative+Models 3. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 4. Splitwise: Efficient Generative LLM Inference Using Phase Splitting — Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Inigo Goiri, Saeed Maleki, Ricardo Bianchini, 2024 https://scholar.google.com/scholar?q=Splitwise:+Efficient+Generative+LLM+Inference+Using+Phase+Splitting 5. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2024 https://scholar.google.com/scholar?q=Mooncake:+A+KVCache-centric+Disaggregated+Architecture+for+LLM+Serving 6. Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs — Jonah Ekelund, Stefano Markidis, Ivy Peng, 2025 https://scholar.google.com/scholar?q=Boosting+Performance+of+Iterative+Applications+on+GPUs:+Kernel+Batching+with+CUDA+Graphs 7. PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch — Abhishek Ghosh, Ajay Nayak, Ashish Panwar, Arkaprava Basu, 2025 https://scholar.google.com/scholar?q=PyGraph:+Robust+Compiler+Support+for+CUDA+Graphs+in+PyTorch 8. Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start — Xueshen Liu, Yongji Wu, Yuncheng Yao, Danyang Zhuo, Ion Stoica, Z. Morley Mao, 2026 https://scholar.google.com/scholar?q=Foundry:+Template-Based+CUDA+Graph+Context+Materialization+for+Fast+LLM+Serving+Cold+Start 9. Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode — Josef Chen, 2026 https://scholar.google.com/scholar?q=Memory-Bound+but+Not+Bandwidth-Limited:+The+Physical+AI+Inference+Gap+in+Batch-1+LLM+Decode 10. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, Sumit Sanghai, 2023 https://scholar.google.com/scholar?q=GQA:+Training+Generalized+Multi-Query+Transformer+Models+from+Multi-Head+Checkpoints 11. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, 2022 https://scholar.google.com/scholar?q=GPTQ:+Accurate+Post-Training+Quantization+for+Generative+Pre-trained+Transformers 12. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Ji Lin et al., 2023 https://scholar.google.com/scholar?q=AWQ:+Activation-aware+Weight+Quantization+for+LLM+Compression+and+Acceleration 13. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision — Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, 2024 https://scholar.google.com/scholar?q=FlashAttention-3:+Fast+and+Accurate+Attention+with+Asynchrony+and+Low-precision 14. FlashDecoding++: Faster Large Language Model Inference on GPUs — Ke Hong et al., 2023 https://scholar.google.com/scholar?q=FlashDecoding++:+Faster+Large+Language+Model+Inference+on+GPUs 15. Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference — Pol G. Recasens et al., 2025 https://scholar.google.com/scholar?q=Mind+the+Memory+Gap:+Unveiling+GPU+Bottlenecks+in+Large-Batch+LLM+Inference 16. Challenges and Research Directions for Large Language Model Inference Hardware — Xiaoyu Ma, David Patterson, 2026 https://scholar.google.com/scholar?q=Challenges+and+Research+Directions+for+Large+Language+Model+Inference+Hardware 17. Medusa: Accelerating Serverless LLM Inference with Materialization — Shaoxun Zeng et al., 2025 https://scholar.google.com/scholar?q=Medusa:+Accelerating+Serverless+LLM+Inference+with+Materialization 18. Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference — Divakar Kumar Yadav and Tian Zhao, 2026 https://scholar.google.com/scholar?q=Hybrid+JIT-CUDA+Graph+Optimization+for+Low-Latency+Large+Language+Model+Inference 19. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration — Ji Lin et al., 2024 https://scholar.google.com/scholar?q=AWQ:+Activation-aware+Weight+Quantization+for+On-Device+LLM+Compression+and+Acceleration 20. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression — Tim Dettmers et al., 2024 https://scholar.google.com/scholar?q=SpQR:+A+Sparse-Quantized+Representation+for+Near-Lossless+LLM+Weight+Compression 21. Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs — Sayed Pedram Haeri Boroujeni et al., 2026 https://scholar.google.com/scholar?q=Don't+Waste+Bits!+Adaptive+KV-Cache+Quantization+for+Lightweight+On-Device+LLMs 22. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 23. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference — Yuhan Liu, Yihua Cheng et al., 2025 https://scholar.google.com/scholar?q=LMCache:+An+Efficient+KV+Cache+Layer+for+Enterprise-Scale+LLM+Inference 24. Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference — Kexin Chu et al., 2025 https://scholar.google.com/scholar?q=Selective+KV-Cache+Sharing+to+Mitigate+Timing+Side-Channels+in+LLM+Inference 25. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 26. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 27. AI Post Transformers: LPU Chip for Low-Latency LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-lpu-chip-for-low-latency-llm-inference-be13c3.mp3 28. AI Post Transformers: CXL Computational Memory Offloading for Lower Runtime — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-cxl-computational-memory-offloading-for-3b2124.mp3 29. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3 30. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3

2. juni 20261 h 0 min

SmolLM2 and the Power of Better Data

This episode explores SmolLM2, a 1.7 billion parameter language model from Hugging Face that tries to compete with stronger small models not by changing the transformer architecture, but by radically improving the training data mix and sequencing across roughly 11 trillion tokens. It explains the distinction between pretraining and instruction tuning, then argues that for compact models, dataset quality and curriculum can function almost like part of the architecture itself. The discussion connects SmolLM2 to earlier work such as Chinchilla, TinyStories, Textbooks Are All You Need, FineWeb-Edu, and DataComp-LM to show why educational web text, curated math and code data, and staged rebalancing matter so much when model capacity is tight. Listeners would find it interesting because it frames a practical question with real deployment stakes: whether careful data design can make smaller, cheaper, lower-latency models genuinely useful without relying on giant-scale compute. Sources: 1. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model — Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf, 2025 http://arxiv.org/abs/2502.02737 2. Training Compute-Optimal Large Language Models — Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Nalisnick, Daniel Yamins, Timothy Lillicrap, Oriol Vinyals, Jeff Dean, et al., 2022 https://scholar.google.com/scholar?q=Training+Compute-Optimal+Large+Language+Models 3. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? — Ronen Eldan, Yuanzhi Li, 2023 https://scholar.google.com/scholar?q=TinyStories:+How+Small+Can+Language+Models+Be+and+Still+Speak+Coherent+English? 4. Textbooks Are All You Need — Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C. T. Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sebastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li, 2023 https://scholar.google.com/scholar?q=Textbooks+Are+All+You+Need 5. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases — Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra, 2024 https://scholar.google.com/scholar?q=MobileLLM:+Optimizing+Sub-billion+Parameter+Language+Models+for+On-Device+Use+Cases 6. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research — Luca Soldaini, Rodney Kinney, Dustin Schwenk, Siddharth Goyal, Alessandro Sordoni, Kyle Lo, Noah A. Smith, and collaborators, 2024 https://scholar.google.com/scholar?q=Dolma:+an+Open+Corpus+of+Three+Trillion+Tokens+for+Language+Model+Pretraining+Research 7. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale — Guilherme Penedo, Hynek Kydlíček, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, Thomas Wolf, 2024 https://scholar.google.com/scholar?q=The+FineWeb+Datasets:+Decanting+the+Web+for+the+Finest+Text+Data+at+Scale 8. Data-Centric AI in the Age of Large Language Models — Xinyi Xu, Zhaoxuan Wu, Rui Qiao, Arun Verma, Yao Shu, Jingtan Wang, Xinyuan Niu, Zhenfeng He, Jiangwei Chen, Zijian Zhou, Gregory Kang Ruey Lau, Hieu Dao, Lucas Agussurja, Rachael Hwee Ling Sim, Xiaoqiang Lin, Wenyang Hu, Zhongxiang Dai, Pang Wei Koh, Bryan Kian Hsiang Low, 2024 https://scholar.google.com/scholar?q=Data-Centric+AI+in+the+Age+of+Large+Language+Models 9. The Stack: 3 TB of permissively licensed source code — Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries, 2022 https://scholar.google.com/scholar?q=The+Stack:+3+TB+of+permissively+licensed+source+code 10. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations — Ning Ding, Yulin Chen, Bokai Xu, et al., 2023 https://scholar.google.com/scholar?q=Enhancing+Chat+Language+Models+by+Scaling+High-quality+Instructional+Conversations 11. OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data — Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, Igor Gitman, 2024 https://scholar.google.com/scholar?q=OpenMathInstruct-2:+Accelerating+AI+for+Math+with+Massive+Open-Source+Instruction+Data 12. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model — Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf, 2025 https://scholar.google.com/scholar?q=SmolLM2:+When+Smol+Goes+Big+--+Data-Centric+Training+of+a+Small+Language+Model 13. DataComp-LM: In search of the next generation of training sets for language models — Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, and many others, 2024 https://scholar.google.com/scholar?q=DataComp-LM:+In+search+of+the+next+generation+of+training+sets+for+language+models 14. OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text — Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, Jimmy Ba, 2023 https://scholar.google.com/scholar?q=OpenWebMath:+An+Open+Dataset+of+High-Quality+Mathematical+Web+Text 15. InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning — Xiaotian Han, Yiren Jian, Xuefeng Hu, Haogeng Liu, Yiqi Wang, Qihang Fan, Yuang Ai, Huaibo Huang, Ran He, Zhenheng Yang, Quanzeng You, 2024 https://scholar.google.com/scholar?q=InfiMM-WebMath-40B:+Advancing+Multimodal+Pre-Training+for+Enhanced+Mathematical+Reasoning 16. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo, 2024 https://scholar.google.com/scholar?q=DeepSeekMath:+Pushing+the+Limits+of+Mathematical+Reasoning+in+Open+Language+Models 17. 2 OLMo 2 Furious — Kyle Lo and the OLMo team, 2025 https://scholar.google.com/scholar?q=2+OLMo+2+Furious 18. Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies — Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, Jingang Wang, 2025 https://scholar.google.com/scholar?q=Revisiting+Scaling+Laws+for+Language+Models:+The+Role+of+Data+Quality+and+Training+Strategies 19. GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining — Simin Fan, Maria Ios Glarou, Martin Jaggi, 2025 https://scholar.google.com/scholar?q=GRAPE:+Optimize+Data+Mixture+for+Group+Robust+Multi-target+Adaptive+Pretraining 20. Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models — Lior Belenki, Alekh Agarwal, Tianze Shi, Kristina Toutanova, 2025 https://scholar.google.com/scholar?q=Optimizing+Pre-Training+Data+Mixtures+with+Mixtures+of+Data+Expert+Models 21. Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies — Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong, 2024 https://scholar.google.com/scholar?q=Scaling+Laws+with+Vocabulary:+Larger+Models+Deserve+Larger+Vocabularies 22. Distilling Reasoning Capabilities into Smaller Language Models — Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan, 2023 https://scholar.google.com/scholar?q=Distilling+Reasoning+Capabilities+into+Smaller+Language+Models 23. Teaching Small Language Models Reasoning through Counterfactual Distillation — Tao Feng, Yicheng Li, Chenglin Li, Hao Chen, Fei Yu, Yin Zhang, 2024 https://scholar.google.com/scholar?q=Teaching+Small+Language+Models+Reasoning+through+Counterfactual+Distillation 24. AI Post Transformers: Self-Improving Pretraining With Post-Trained Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-02-self-improving-pretraining-with-post-tra-e37460.mp3 25. AI Post Transformers: Scaling Laws for Multilingual Code Pretraining — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-scaling-laws-for-multilingual-code-pretr-7d220e.mp3 26. AI Post Transformers: Can Models Learn from Long Context? — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-can-models-learn-from-long-context-77533e.mp3 27. AI Post Transformers: ASI-Evolve for Data, Architectures, and RL — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-05-asi-evolve-for-data-architectures-and-rl-197b2b.mp3 28. AI Post Transformers: Muon Is Scalable for LLM Training — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-muon-is-scalable-for-llm-training-587ed8.mp3

1. juni 20261 h 0 min

Opaque Serial Depth and Chain-of-Thought Limits

Beskrivelse

Kommentarer

1 måned kun 9 kr.

Alle episoder