Critical Batch Size for LLM Policy Optimization

18 min · 11 de jun de 2026

Descripción

This paper investigates the critical batch size (CBS) for Large Language Model (LLM) policy optimization, specifically focusing on the GRPO algorithm. The researchers break down gradient noise into inter-prompt and intra-prompt components to determine the point where increasing data parallelism yields diminishing returns. Their findings reveal that on-policy training is primarily limited by noise within individual prompts, meaning the total rollout count is the most important factor for efficiency. In contrast, off-policy rollout reuse significantly expands the critical batch size, allowing for much greater computational parallelism. By modeling how policy drift inflates gradient noise, the study provides a theoretical and empirical framework for optimizing training efficiency in verifiable reinforcement learning. These results offer practical guidance for allocating hardware resources during the post-training phase of model development.

Comentarios

Sé la primera persona en comentar

¡Regístrate ahora y únete a la comunidad de Best AI papers explained!

Empezar

Todos los episodios

764 episodios

Agentic Interactions

This paper explores how AI agents inherit and potentially amplify human heterogeneity when tasked with negotiating on behalf of individuals. By comparing agentic interactions to a human-to-human benchmark, the study reveals that instructional prompts act as carriers for the principal's personality, biases, and demographic traits. Remarkably, delegating decisions to machines leads to a greater dispersion of outcomes and a breakdown of traditional fairness norms, such as the 50/50 split. The authors introduce the concept of "machine fluency"—the unique skill of effectively aligning an AI's behavior with one’s own goals—as a new source of economic inequality. These findings suggest that the agentic economy will not be a standardized marketplace, but rather one shaped by specification hazards and the latent characteristics of the humans who design the agents. Ultimately, the transition to AI mediation appears to transform and intensify existing social disparities rather than eliminating them.

17 de jun de 202619 min

A Unifying View of Attention Sinks: Two Algorithms, Two Solutions

This research investigates the nature of attention sinks, which are specific tokens in Transformer models that attract disproportionate attention. The authors reveal that these identical visual patterns actually facilitate two distinct computational algorithms: Adaptive NOP and Broadcast. In the Adaptive NOP mechanism, the model uses a "null" token with near-zero value to suppress updates to the residual stream, essentially performing a "no-op" instruction. Conversely, the Broadcast mechanism uses a sink as a communication hub to aggregate and redistribute global information across the entire sequence. By applying specialized diagnostics to vision transformers (ViTs), the study proves that both mechanisms coexist and often transition from the [CLS] token to specific patch tokens in deeper layers. Finally, the authors demonstrate that combining gated attention with register tokens effectively mitigates these artifacts, leading to significantly improved performance in dense spatial tasks.

Ayer22 min

From AGI to ASI

This report from Google DeepMind explores the hypothetical transition from Artificial General Intelligence (AGI), which matches human capability, to Artificial Superintelligence (ASI), which far exceeds it. The authors outline four primary technological pathways to achieve this: quantitative scaling, algorithmic paradigm shifts, recursive self-improvement, and multi-agent coordination. While current growth in effective compute suggests rapid progress, the text identifies significant frictions such as the "data wall," economic resource limits, and the "abstraction barrier" that may bound machine intelligence. The report also provides a formal grounding for superintelligence through the Universal AI framework and the Legg-Hutter measure of intelligence. Ultimately, the sources argue that predicting the post-AGI future requires a massive interdisciplinary research effort to navigate high levels of uncertainty. This overview emphasizes that while ASI is not omnipotent, its digital advantages—like substrate independence and high-bandwidth sharing—could fundamentally reshape human society.

14 de jun de 202623 min

Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

This research explores whether pairwise comparisons used to rank generative models actually reflect ground-truth accuracy. By converting multiple benchmarks into free-form formats, the authors found that Elo-style rankings achieve a remarkably high correlation with objective correctness. Surprisingly, this alignment remains strong even when the judge model is weaker than the candidates it evaluates, outperforming direct grading methods. While critics often worry about judge biases or stylistic cues, the study demonstrates that these factors have a minimal impact on the final model hierarchy. Furthermore, the paper identifies "echo"—or repetitive output—as a key reason why judges prefer one answer over another when both are technically correct. Ultimately, the results suggest that relative preferences are a robust and reliable proxy for absolute accuracy in competitive model evaluation.

13 de jun de 202619 min

Critical Batch Size for LLM Policy Optimization

11 de jun de 202618 min

Critical Batch Size for LLM Policy Optimization

Descripción

Comentarios

2 meses por 1 €

Todos los episodios