When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions

Descripción

WHEN SMARTER MODELS FORECAST WORSE: THE HIDDEN FAILURE MODE IN LLM PREDICTIONS Source: Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most [https://arxiv.org/abs/2605.22672] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Claude Opus 4.6 looked at Brazil's 1986 hyperinflation, correctly named the regime-change risk, then forecast a number seven million times too high. A new paper shows this isn't a fluke — it's a structural pattern across epidemics, housing bubbles, and decades of measles data, and the standard way the field grades LLM forecasts can't see it. KEY TAKEAWAYS * Why the same model outputs can earn opposite verdicts — capable models look best under Brier-style scoring and worst under CRPS — and what that means for every existing LLM forecasting benchmark * The specific trigger for the inversion: superlinear growth followed by a regime change, confirmed by a clean linear-growth control where the effect vanishes entirely * A within-family Llama experiment showing scale and post-training each independently make the overcommitment worse, and compound when combined * The unselected pre-vaccine US measles cohort (1,339 state-seasons) that rules out the 'you cherry-picked the crashes' objection, plus flu as a pre-registered negative control * Why naming the historical episode rescues calibration for COVID and housing but fails completely for hyperinflation — the knowledge is in the model, but it doesn't reach the tails * The one-line fix: report a tail-integrating proper scoring rule alongside threshold metrics, using forecasts benchmarks have already collected * 00:00 — The Opus 4.6 hyperinflation moment A frontier model articulates the regime-change possibility, weighs it on the page, and then commits to extrapolation anyway — overshooting reality by a factor of seven million. * 03:47 — Two ways to grade a forecast Distributional forecasts, threshold-based Brier scoring versus tail-integrating CRPS, and the weather-map analogy that makes the asymmetry click. * 07:35 — The Freeciv benchmark and the first crack A clean, unseen forecasting setup where binary and continuous versions of the same question yield opposite capability-accuracy correlations at long horizons. * 11:22 — The synthetic epidemic and its linear-growth control An exponential-then-crash simulator reproduces the inversion, and swapping in linear growth makes it vanish — pinning the mechanism to the bend-then-break shape. * 11:32 — Competence-driven overcommitment Per-quantile decomposition shows the lower tail stays flat while the upper tail balloons with capability, and the within-family Llama 2x2 confirms scale and post-training each contribute. * 18:57 — Real-world replications and the measles test COVID, housing, and hyperinflation replicate the pattern, but the unselected measles cohort and a pre-registered flu negative control are what make the result hard to dismiss. * 22:45 — The verdict flip and what the model knows The same forecasts graded two ways reverse the sign of the capability correlation, and a knowledge probe reveals models can name the crisis they're forecasting yet still produce extreme tail overshoots. * 26:33 — Limitations, the steelman, and the fix Honest pushback on the capability axis, the bundled post-training treatment, and the small hyperinflation sample — followed by the embarrassingly simple methodological recommendation. RECOMMENDED READING * Are Emergent Abilities of Large Language Models a Mirage? [https://arxiv.org/abs/2304.15004] — The Schaeffer et al. paper the episode invokes directly — argues that metric choice can manufacture apparent emergent abilities, setting up this episode's darker mirror claim that metric choice can also hide failures. * Inverse Scaling Prize: Second Round Winners (McKenzie et al.) [https://arxiv.org/abs/2306.09479] — The original taxonomy of inverse-scaling failures the episode contrasts with — useful for understanding why this paper's forecasting failure is structurally different from earlier adversarial cases. * Inverse Scaling Can Become U-Shaped [https://arxiv.org/abs/2211.02011] — Wei et al.'s follow-up showing many inverse-scaling tasks recover at frontier scale — the precise counterpoint to this episode's claim that the forecasting failure is monotonic all the way to the frontier.

Finding Millions of Readable Concepts Inside a Real, Deployed AI Model

FINDING MILLIONS OF READABLE CONCEPTS INSIDE A REAL, DEPLOYED AI MODEL Source: Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet [https://arxiv.org/abs/2605.29358] Paper was published on May 28, 2026 This episode was AI-generated on May 29, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Researchers reached into Claude's internals, found the single thread that means 'Golden Gate Bridge,' and turned it up until the model believed it was the bridge. This episode unpacks the paper that proved interpretability works on a real commercial model — and is unusually honest about everything it still can't do. KEY TAKEAWAYS * Why individual neurons mean nothing, and how the 'superposition' idea — concepts as blended directions, like mixing paint — explains it * How sparse autoencoders un-mix those directions into millions of human-readable features, and how scaling laws turned 'how big a dictionary' into an engineering decision * The crucial difference between a feature that merely correlates with a concept (a thermometer) and one you can pull to change behavior (a thermostat) * Why the reasoning that actually mattered in the Kobe Bryant trivia chain was the seventieth-loudest signal — loudness and importance turn out to be different things * Why finding a 'deception' or 'bioweapon' feature is not an alarm bell, and what the authors say the real safety signal would be * Where the paper is weakest: no ground truth, circular Claude-grades-Claude evaluation, off-distribution steering, cherry-picked reasoning chains, and dictionaries that miss most of what's there * 00:00 — Golden Gate Claude and the question of where concepts live The opening demo sets up the central puzzle: what is a nameable 'thread' inside a pile of numbers, and why can't you just read it off the neurons? * 03:05 — Superposition and dictionary learning The paint-mixing intuition for why concepts are directions rather than neurons, and how sparse autoencoders recover those directions by reconstructing the model's state from a tiny handful of features. * 06:10 — From toy models to a real one Why scaling this to Claude 3 Sonnet — and deriving Chinchilla-style scaling laws to pick a 34-million-feature dictionary — was an existential test for the whole field. * 09:15 — Are the features real? Abstraction and causation Features that fire across languages and even images, the 'bug in code' detector, and the thermometer-versus-thermostat distinction that the paper's credibility rests on. * 12:20 — Watching the model reason: the Kobe Bryant chain How knocking out features one at a time revealed a causal hop from Kobe to Lakers to LA to California to Sacramento — and why the load-bearing features were buried deep in the noise. * 14:05 — The periodic-table finding How concept frequency predicts when a concept gets its own feature, why a one-in-a-billion concept needs a billion-feature dictionary, and how features split as the microscope gets sharper. * 18:30 — Safety-relevant features, carefully framed Deception, secrecy, hate, and self-concept features exist — but the authors argue the real question is when they fire, not that they exist, illustrated with honesty-lever and forced-screed demos. * 19:55 — Where the paper is weakest The authors' own reservations: no ground truth, the circular Claude-grades-Claude evaluation, the sensitivity gap, extreme off-distribution steering, cherry-picked chains, and demonstrably incomplete dictionaries. * 24:41 — What it actually settled The technique survived contact with a real model and made unsupervised, one-time-cost interpretability credible — while leaving the safety payoff an explicit aspiration rather than a result. RECOMMENDED READING * Toy Models of Superposition [https://arxiv.org/abs/2209.10652] — The earlier Anthropic work that introduced the superposition hypothesis the episode leans on—the paint-mixing intuition for why single neurons are polysemantic—but only on the toy models this paper had to prove scalable. * Towards Monosemanticity: Decomposing Language Models With Dictionary Learning [https://transformer-circuits.pub/2023/monosemantic-features/index.html] — The one-layer 'sandbox' study whose skeptical reception ('cute, but does it scale?') is the exact existential question this episode says the Sonnet paper was built to answer. * Training Compute-Optimal Large Language Models (Chinchilla) [https://arxiv.org/abs/2203.15556] — The scaling-law paper the episode name-checks as the template for deciding how big the 34-million-feature dictionary should be—turning a gamble into a curve you can read off. * Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (Othello-GPT) [https://arxiv.org/abs/2210.13382] — The Othello cautionary tale the hosts cite—researchers assumed the wrong board representation—illustrating why the episode prizes unsupervised dictionary learning over hand-built detectors.

Ayer27 min

When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios