AI Papers: A Deep Dive
HOW A FIFTEEN-HUNDRED-DOLLAR TRAINING RUN MATCHED LLAMA AND GEMMA ON REASONING Source: HRM-Text: Efficient Pretraining Beyond Scaling [https://arxiv.org/abs/2605.20613] Paper was published on May 20, 2026 This episode was AI-generated on May 24, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A team at Sapient Intelligence and MIT trained a 1B-parameter model on 16 GPUs in 46 hours for about $1,500 — and it goes toe-to-toe with Llama, Qwen, Gemma, and OLMo on math and reasoning benchmarks. The authors argue this isn't just a democratization story: it's evidence that the trillion-token pretraining race was solving a problem better architecture and a smarter objective could have partly avoided. KEY TAKEAWAYS * Why standard Transformers waste most of their depth, and how HRM-Text's fast/slow recurrent modules (L runs 3x for every H update, twice per forward pass) actually keep deliberating through the final layer * The MagicNorm trick: how a single placement of normalization behaves like PreNorm on the backward pass and PostNorm on the forward pass, because the two horizons have different lengths * Why grading the model only on response tokens — not on the question — concentrates the gradient signal and jumps MMLU from 40 to 48 with no other changes * How PrefixLM attention lets the model read the prompt freely while still generating answers one token at a time, adding another 5 points on MMLU * Three honest pushbacks: HRM-Text is trained directly on instruction-response pairs (not apples-to-apples with general foundation models), the curated data mixture isn't isolated in the ablation, and scaling beyond 1B parameters is unverified * Why the right frame is 'existence proof, not new paradigm': the compute-to-performance ratio isn't a law of nature, and architectural questions are accessible to small labs again * 00:00 — The fifteen-hundred-dollar headline The setup: a 1B model trained for $1,500 matches models that cost 100-900x more, and why the two assumptions baked into standard pretraining make that possible. * 02:38 — The H and L modules: fast and slow deliberation How HRM-Text borrows the frontoparietal loop's fast-execution/slow-strategy split and reuses weights recurrently instead of stacking more layers. * 05:16 — MagicNorm and the asymmetric tightrope Why recurrent models are notoriously hard to train, and the clever normalization placement that exploits the gap between an 8-step forward pass and a truncated backward pass. * 07:54 — Stop grading the model on the question The exam-grader analogy: why computing loss only on response tokens — not the prompt — concentrates gradient signal where it matters. * 10:32 — PrefixLM: reading freely, writing causally How letting the question tokens see each other bidirectionally while keeping answer generation causal gives encoder-like reading behavior without a second model. * 13:10 — The logit lens test: is the recurrence doing real work? Evidence that, unlike standard Transformers which lock in predictions early, HRM-Text's recurrent cycles keep meaningfully updating the answer to the end. * 15:49 — Three honest pushbacks Not apples-to-apples comparisons, uncontrolled data curation, and unverified scaling — what the headline numbers do and don't justify. * 18:27 — What survives the critique Why the narrower claim — that current pretraining leaves enormous efficiency on the table — holds, and what it means for who gets to do architecture research. RECOMMENDED READING * Universal Transformers [https://arxiv.org/abs/1807.03819] — The classic recurrent-Transformer paper that established the 'reuse the same block many times' idea HRM-Text builds on with its fast/slow split. * Looped Transformers as Programmable Computers [https://arxiv.org/abs/2301.13196] — A more recent treatment of looped/recurrent Transformers that sharpens the case Bella makes for getting more computation per parameter. * Scaling Laws for Neural Language Models (Kaplan et al.) [https://arxiv.org/abs/2001.08361] — The foundational scaling-laws paper whose 'just add tokens and parameters' worldview HRM-Text is implicitly arguing against. * Training Compute-Optimal Large Language Models (Chinchilla) [https://arxiv.org/abs/2203.15556] — The other half of the scaling-orthodoxy story — useful context for evaluating the episode's claim that the trillion-token race left efficiency on the table.
145 afleveringen
Reacties
0Wees de eerste die een reactie plaatst
Meld je nu aan en word lid van de AI Papers: A Deep Dive community!