AI Papers: A Deep Dive
WHEN SMARTER MODELS FORECAST WORSE: THE HIDDEN FAILURE MODE IN LLM PREDICTIONS Source: Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most [https://arxiv.org/abs/2605.22672] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Claude Opus 4.6 looked at Brazil's 1986 hyperinflation, correctly named the regime-change risk, then forecast a number seven million times too high. A new paper shows this isn't a fluke — it's a structural pattern across epidemics, housing bubbles, and decades of measles data, and the standard way the field grades LLM forecasts can't see it. KEY TAKEAWAYS * Why the same model outputs can earn opposite verdicts — capable models look best under Brier-style scoring and worst under CRPS — and what that means for every existing LLM forecasting benchmark * The specific trigger for the inversion: superlinear growth followed by a regime change, confirmed by a clean linear-growth control where the effect vanishes entirely * A within-family Llama experiment showing scale and post-training each independently make the overcommitment worse, and compound when combined * The unselected pre-vaccine US measles cohort (1,339 state-seasons) that rules out the 'you cherry-picked the crashes' objection, plus flu as a pre-registered negative control * Why naming the historical episode rescues calibration for COVID and housing but fails completely for hyperinflation — the knowledge is in the model, but it doesn't reach the tails * The one-line fix: report a tail-integrating proper scoring rule alongside threshold metrics, using forecasts benchmarks have already collected * 00:00 — The Opus 4.6 hyperinflation moment A frontier model articulates the regime-change possibility, weighs it on the page, and then commits to extrapolation anyway — overshooting reality by a factor of seven million. * 03:47 — Two ways to grade a forecast Distributional forecasts, threshold-based Brier scoring versus tail-integrating CRPS, and the weather-map analogy that makes the asymmetry click. * 07:35 — The Freeciv benchmark and the first crack A clean, unseen forecasting setup where binary and continuous versions of the same question yield opposite capability-accuracy correlations at long horizons. * 11:22 — The synthetic epidemic and its linear-growth control An exponential-then-crash simulator reproduces the inversion, and swapping in linear growth makes it vanish — pinning the mechanism to the bend-then-break shape. * 11:32 — Competence-driven overcommitment Per-quantile decomposition shows the lower tail stays flat while the upper tail balloons with capability, and the within-family Llama 2x2 confirms scale and post-training each contribute. * 18:57 — Real-world replications and the measles test COVID, housing, and hyperinflation replicate the pattern, but the unselected measles cohort and a pre-registered flu negative control are what make the result hard to dismiss. * 22:45 — The verdict flip and what the model knows The same forecasts graded two ways reverse the sign of the capability correlation, and a knowledge probe reveals models can name the crisis they're forecasting yet still produce extreme tail overshoots. * 26:33 — Limitations, the steelman, and the fix Honest pushback on the capability axis, the bundled post-training treatment, and the small hyperinflation sample — followed by the embarrassingly simple methodological recommendation. RECOMMENDED READING * Are Emergent Abilities of Large Language Models a Mirage? [https://arxiv.org/abs/2304.15004] — The Schaeffer et al. paper the episode invokes directly — argues that metric choice can manufacture apparent emergent abilities, setting up this episode's darker mirror claim that metric choice can also hide failures. * Inverse Scaling Prize: Second Round Winners (McKenzie et al.) [https://arxiv.org/abs/2306.09479] — The original taxonomy of inverse-scaling failures the episode contrasts with — useful for understanding why this paper's forecasting failure is structurally different from earlier adversarial cases. * Inverse Scaling Can Become U-Shaped [https://arxiv.org/abs/2211.02011] — Wei et al.'s follow-up showing many inverse-scaling tasks recover at frontier scale — the precise counterpoint to this episode's claim that the forecasting failure is monotonic all the way to the frontier.
99 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de AI Papers: A Deep Dive!