How a 7B Model Out-Investigates a 72B One by Choosing What to Look At

Beschrijving

HOW A 7B MODEL OUT-INVESTIGATES A 72B ONE BY CHOOSING WHAT TO LOOK AT Source: Native Active Perception as Reasoning for Omni-Modal Understanding [https://arxiv.org/abs/2606.19341] Paper was published on June 17, 2026 This episode was AI-generated on June 18, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A seven-billion-parameter model beats one ten times its size on long videos while looking at seventy-three percent fewer frames — by treating the act of looking as a reasoning step instead of a fixed cost. The trick: the model takes notes in plain text, purges the raw pixels, and spends effort in proportion to how hard the question is, not how long the footage runs. We dig into why that breaks the old cost curve, and where the paper's clever entropy machinery does and doesn't earn its billing. KEY TAKEAWAYS * Why the standard 'pour every frame into the model' approach makes a trivial question about a three-hour film cost as much as the hardest one * How forcing the model to write text notes and discard raw frames keeps compute cost flat as videos grow four times longer * The temporal-grounding result where the agent jumped 33 points absolute and beat GPT-4o and Gemini-2.5-Pro at finding exact moments * How entropy is used as a 'stress meter' to send training credit to the pivotal decision steps rather than smearing it across routine ones * Why the hosts argue the entropy credit-assignment fix is a refinement worth a point or less — the architecture, not the RL trick, is doing the heavy lifting * The open question the paper doesn't answer: the RL was only trained on sub-five-minute clips, yet every headline claim is about hour-plus footage * 00:00 — The brute-force wall in video AI Why dumping every frame into a model makes answer cost scale with video length instead of question difficulty, and hits a memory wall on long footage. * 02:02 — Looking as a reasoning step The core move — a single model that decides what to look at, interprets it, and answers, running in a detective-style loop that purges raw pixels and keeps only text notes. * 05:09 — Proving the cost curve stays flat The cleanest result in the paper: as videos grow four times longer the agent does roughly the same work, plus the honest caveat that timestamp metadata is doing quiet work. * 07:43 — Temporal grounding and the speed surprise A 33-point jump on finding exact moments, beating much larger models, while running faster and on a quarter of the hardware. * 10:18 — Training the investigator: imitation first Why you can't just hand a fresh model a reward signal, how teacher trajectories are filtered for both correct answers and justified reasoning, and why deliberately keeping mistakes in matters. * 12:52 — The entropy credit-assignment idea Using the model's own uncertainty as a stress meter to amplify credit on bold-and-right moments and penalize confused-and-wrong ones, illustrated by the Coca-Cola/American Express trace. * 15:27 — Pressure-testing the claims The hosts argue the entropy fix buys far less than the narrative suggests, the RL was never trained at the long durations being headlined, and the pivotal-step metric is a proxy validated by another proxy. * 18:02 — From thinking harder to looking smarter Test-time scaling shows more deliberation helps but the agent still stops when confident, landing the paper's real thesis: for long video the bottleneck is perceptual incompleteness, not reasoning depth. RECOMMENDED READING * Video-STaR / Visual Programming approaches aside — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [https://arxiv.org/abs/2501.12948] — The episode's RL act builds on the GRPO-style 'one reward broadcast to the whole trajectory' approach this paper popularized — useful for understanding the 'advantage homogenization' flaw the episode critiques. * ReAct: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629] — The canonical formulation of the reason–act–observe loop that this episode's 'looking as a reasoning step' agent extends to video perception. * Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning [https://arxiv.org/abs/2506.01939] — Directly relevant to the episode's central claim that high-entropy moments mark the pivotal decision points worth amplifying during RL credit assignment.

Training a Model to Mean What It Says, And Why That Isn't the Same as Being Good

TRAINING A MODEL TO MEAN WHAT IT SAYS, AND WHY THAT ISN'T THE SAME AS BEING GOOD Source: Self-CTRL: Self-Consistency Training with Reinforcement Learning [https://arxiv.org/abs/2606.18327] Paper was published on June 16, 2026 This episode was AI-generated on June 18, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. For a decade, nobody trusted an AI's account of itself enough to use it for auditing. A new MIT paper tries to train that self-knowledge into existence — and gets a model's stated rules from coin-flip-predictive to 92% predictive of its actual behavior. But there's a catch the authors are unusually honest about: a model can become perfectly consistent by quietly lowering its own standards, and the optimizer often prefers exactly that. KEY TAKEAWAYS * Why standard language model training never rewards self-consistency — the model is scored on each answer in isolation, so its stated principles and its actual behavior are never dragged into the same room * The two ways to close the words-deeds gap: 'explanation training' (rewrite the self-description to match behavior, for transparency) versus 'behavior training' (change behavior to honor the description, for alignment) — and why a balanced blend beats either extreme * The clean coin-flip proof: with no ground-truth labels, the model recovers nearly the same self-knowledge (R-squared ~0.66) as an oracle that was handed the answer key * How an eight-juror panel of clashing ethical frameworks functions not as moral balance but as a vagueness detector that punishes vacuous, predict-nothing policies * The uncomfortable failure case: on a discriminatory-CV request, explanation training makes the model honest about behaving badly by narrowing its stated rule — achieving 'consistency' without making the model better * Where the method breaks: it barely works on the permissive Qwen model (no contested refusal boundary to test against), the evaluation is graded almost entirely by other models, and a chunk of the safety gain matches existing self-judgment methods * 00:00 — The gap between what a model says and what it does Why the field distrusts a model's self-description, illustrated by Llama stating an anti-discrimination principle and then violating it one breath later. * 03:14 — The diagnosis: self-consistency was never on the test How standard training scores responses in isolation, and the proposed fix of rewarding cross-context agreement between a meta-level explanation and object-level behavior. * 06:29 — Predictable, not virtuous, and the two doors to consistency Why the objective rewards explanations that predict behavior rather than wise ones, and the choice between transparency-style and alignment-style training along a single knob. * 09:44 — The coin sandbox: recovering self-knowledge without labels A checkable toy experiment where the model learns to state its own hidden coin biases purely by checking against its own flips, nearly matching an oracle that cheated. * 12:59 — Moving to fuzzy rules: the jury as a vagueness detector Applying the method to constitutional AI with an eight-framework juror panel, and how juror disagreement exposes vacuous policies and prevents collapse to a trivial fixed point. * 16:14 — Does it work? The auditor test and the safety numbers A third-party model predicting behavior from stated rules jumps from 36% to 92%, attack success drops thirty-fold, with a real but modest cost in over-refusal. * 19:29 — The tension the paper doesn't close The discriminatory-CV case where explanation training achieves consistency by narrowing the rule rather than fixing the behavior, and why predictable isn't the same as trustworthy. * 22:44 — Limitations, circularity, and the Qwen failure The risks of model-graded evaluation, the method's collapse on a permissive base model, the overlap with existing self-judgment RL, and why its low cost still makes it worth taking seriously. RECOMMENDED READING * Constitutional AI: Harmlessness from AI Feedback [https://arxiv.org/abs/2212.08073] — The constitutional-AI recipe this episode builds on and critiques — the 'model grades itself against written principles' baseline that nearly matches Self-CTRL's safety gains. * Language Models (Mostly) Know What They Know [https://arxiv.org/abs/2207.05221] — Directly relevant to the episode's core claim that self-knowledge is latent and recoverable — it probes whether models can accurately predict their own correctness, the same gap Self-CTRL trains shut. * Measuring Faithfulness in Chain-of-Thought Reasoning [https://arxiv.org/abs/2307.13702] — Sharpens the episode's distinction between a model's stated account and its actual behavior, examining when self-explanations genuinely predict outputs versus serving as post-hoc rationalization. * Discovering Language Model Behaviors with Model-Written Evaluations [https://arxiv.org/abs/2212.09251] — Speaks to the episode's worry about LM-graded evaluation circularity, showing both the power and the shared-blind-spot risks of using models to probe and judge other models.

19 jun 202625 min

How a 7B Model Out-Investigates a 72B One by Choosing What to Look At

Beschrijving

Reacties

Probeer 14 dagen gratis

Alle afleveringen