GRPO is Secretly a Process Reward Model

20 min · Ayer

Descripción

This paper establishs that Group Relative Policy Optimization (GRPO), while appearing to use only final outcome rewards, inherently functions as a Process Reward Model (PRM) through its implicit sub-trajectory credit assignment. By analyzing groups of trajectories that share identical prefixes, the authors prove that GRPO naturally computes step-level rewards using a Monte Carlo approach. However, this hidden structure reveals a flaw where imbalanced step frequencies can skew advantages, inadvertently suppressing high-reward paths and hindering efficient model training. To fix this, the researchers introduce $\lambda$-GRPO, a modified objective that scales token-level losses to neutralize these frequency imbalances. Empirical testing shows that $\lambda$-GRPO enables Large Language Models to achieve superior reasoning performance significantly faster than the standard algorithm. Ultimately, the work demonstrates that the built-in PRM structure of GRPO can be optimized to boost efficiency without the need for expensive, manual step-level annotations.

Comentarios

Sé la primera persona en comentar

¡Regístrate ahora y únete a la comunidad de Best AI papers explained!

Empezar

Todos los episodios

765 episodios

GRPO is Secretly a Process Reward Model

Ayer20 min

Agentic Interactions

This paper explores how AI agents inherit and potentially amplify human heterogeneity when tasked with negotiating on behalf of individuals. By comparing agentic interactions to a human-to-human benchmark, the study reveals that instructional prompts act as carriers for the principal's personality, biases, and demographic traits. Remarkably, delegating decisions to machines leads to a greater dispersion of outcomes and a breakdown of traditional fairness norms, such as the 50/50 split. The authors introduce the concept of "machine fluency"—the unique skill of effectively aligning an AI's behavior with one’s own goals—as a new source of economic inequality. These findings suggest that the agentic economy will not be a standardized marketplace, but rather one shaped by specification hazards and the latent characteristics of the humans who design the agents. Ultimately, the transition to AI mediation appears to transform and intensify existing social disparities rather than eliminating them.

Ayer19 min

A Unifying View of Attention Sinks: Two Algorithms, Two Solutions

This research investigates the nature of attention sinks, which are specific tokens in Transformer models that attract disproportionate attention. The authors reveal that these identical visual patterns actually facilitate two distinct computational algorithms: Adaptive NOP and Broadcast. In the Adaptive NOP mechanism, the model uses a "null" token with near-zero value to suppress updates to the residual stream, essentially performing a "no-op" instruction. Conversely, the Broadcast mechanism uses a sink as a communication hub to aggregate and redistribute global information across the entire sequence. By applying specialized diagnostics to vision transformers (ViTs), the study proves that both mechanisms coexist and often transition from the [CLS] token to specific patch tokens in deeper layers. Finally, the authors demonstrate that combining gated attention with register tokens effectively mitigates these artifacts, leading to significantly improved performance in dense spatial tasks.

16 de jun de 202622 min

From AGI to ASI

This report from Google DeepMind explores the hypothetical transition from Artificial General Intelligence (AGI), which matches human capability, to Artificial Superintelligence (ASI), which far exceeds it. The authors outline four primary technological pathways to achieve this: quantitative scaling, algorithmic paradigm shifts, recursive self-improvement, and multi-agent coordination. While current growth in effective compute suggests rapid progress, the text identifies significant frictions such as the "data wall," economic resource limits, and the "abstraction barrier" that may bound machine intelligence. The report also provides a formal grounding for superintelligence through the Universal AI framework and the Legg-Hutter measure of intelligence. Ultimately, the sources argue that predicting the post-AGI future requires a massive interdisciplinary research effort to navigate high levels of uncertainty. This overview emphasizes that while ASI is not omnipotent, its digital advantages—like substrate independence and high-bandwidth sharing—could fundamentally reshape human society.

14 de jun de 202623 min

Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

This research explores whether pairwise comparisons used to rank generative models actually reflect ground-truth accuracy. By converting multiple benchmarks into free-form formats, the authors found that Elo-style rankings achieve a remarkably high correlation with objective correctness. Surprisingly, this alignment remains strong even when the judge model is weaker than the candidates it evaluates, outperforming direct grading methods. While critics often worry about judge biases or stylistic cues, the study demonstrates that these factors have a minimal impact on the final model hierarchy. Furthermore, the paper identifies "echo"—or repetitive output—as a key reason why judges prefer one answer over another when both are technically correct. Ultimately, the results suggest that relative preferences are a robust and reliable proxy for absolute accuracy in competitive model evaluation.

13 de jun de 202619 min

GRPO is Secretly a Process Reward Model

Descripción

Comentarios

2 meses por 1 €

Todos los episodios