Today in arXiv AI

Podcast door Scot Bearss

Engels

Technologie en Wetenschap

Probeer 14 dagen gratis

€ 9,99 / maand na proefperiode.Elk moment opzegbaar.

20 uur luisterboeken / maand
Podcasts die je alleen op Podimo hoort
Gratis podcasts

Probeer gratis

Over Today in arXiv AI

Today in arXiv AI is your daily deep dive into the cutting edge of artificial intelligence. Every morning, we unpack the latest breakthroughs in LLM architectures, agentic AI, multimodal models, scaling strategies, safety research and more—mixing expert analysis, lively debate, and real‑world use cases. Whether you’re an AI practitioner, tech leader, or just curious about what’s next, we break down complex papers (and what they mean for you) into a fast‑paced, two‑host conversation you’ll actually enjoy. I am an independent creator and not affiliated with arXiv. Sources linked in descriptions

Alle afleveringen

7 afleveringen

Cognition, Contracts, and Compression

Generated Google NotebookLM. Episode Description: In this episode, we explore 10 new papers advancing our understanding of how LLMs think, how agents can be trusted, and how systems can scale more efficiently: * What LLMs really "know" – UCCT proposes a formal theory of cognition in LLMs, arguing intelligence is emergent and context-triggered—not intrinsic. * Rethinking RAG – CoCoA and CoCoA-zero show how multi-agent collaboration improves synergy between internal model memory and retrieved context. * Efficiency, by design – Efficient Agents sheds light on cost/performance trade-offs in agent systems, while Blueprint First separates logic from generation to enable deterministic workflows. * Contrastive learning, upgraded – Context-Adaptive Multi-Prompt Embedding improves vision-language alignment with adaptive token prompts and diversity constraints. * Inference-time teaming – CTTS scales up LLM performance via collective test-time scaling, using reward model ensembles and agent collaboration. * At the edge – A new adaptive agent placement and migration framework uses LLMs and ant colony optimization to meet real-time edge constraints. * Smarter chains of thought – A step entropy metric allows LLMs to prune redundant reasoning during inference, improving cost-efficiency without sacrificing accuracy. * Quantization, vision-style – VLMQ brings post-training quantization to Vision-Language Models, optimizing for both modality balance and efficiency. * Reliable by contract – A Design-by-Contract–inspired layer enables neurosymbolic agents to enforce input-output constraints, offering a formal basis for agent safety. From the nature of LLM cognition to practical methods for verifiable, scalable deployment, this episode highlights where theory meets engineering—and where structure enhances trust. Sources: * The Unified Cognitive Consciousness Theory for Language Models (UCCT) [https://arxiv.org/pdf/2506.02139] | HTML [https://arxiv.org/html/2506.02139v4] * CoCoA: Collaborative Chain-of-Agents for Parametric-Retrieved Knowledge Synergy [https://arxiv.org/pdf/2508.01696] | HTML [https://arxiv.org/html/2508.01696v2] * Efficient Agents: Building Effective Agents While Reducing Cost [https://arxiv.org/pdf/2508.02694] | HTML [https://arxiv.org/html/2508.02694v1] * Blueprint First, Model Second: A Framework for Deterministic LLM Workflow [https://arxiv.org/pdf/2508.02721] | HTML [https://arxiv.org/html/2508.02721v1] * Context-Adaptive Multi-Prompt LLM Embedding for Vision-Language Alignment [https://arxiv.org/pdf/2508.02762] | HTML [https://arxiv.org/html/2508.02762v1] * CTTS: Collective Test-Time Scaling [https://arxiv.org/pdf/2508.03333] | HTML [https://arxiv.org/html/2508.03333v1] * Adaptive AI Agent Placement and Migration in Edge Intelligence Systems [https://arxiv.org/pdf/2508.03345] | HTML [https://arxiv.org/html/2508.03345v1] * Compressing Chain-of-Thought in LLMs via Step Entropy [https://arxiv.org/pdf/2508.03346] | HTML [https://arxiv.org/html/2508.03346v1] * VLMQ: Efficient Post-Training Quantization for Vision-Language Models [https://arxiv.org/pdf/2508.03351] | HTML [https://arxiv.org/html/2508.03351v1] * A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design [https://arxiv.org/pdf/2508.03665] | HTML [https://arxiv.org/html/2508.03665v1]

6 aug 2025 - 26 min

Architectures, Attacks, and Autonomy

This episode dives into 15 new research papers pushing the boundaries of LLM architecture, safety, and real-world deployment: * Training and architecture breakthroughs – Mix-LN introduces a hybrid layer-norm strategy that unlocks deeper layers; a new residual stream inspired by associative memory accelerates in-context learning; and meta-experience replay stabilizes continual pretraining with minimal overhead. * Factuality and trust – A reinforcement learning framework with mechanistic interpretability improves factual consistency in reasoning chains, while AdaCoRe and SOP block restricted content dynamically, with no need for finetuning. * Jailbreaks and watermarking – PUZZLED bypasses filters using crossword-like obfuscation, while FPEdit subtly fingerprints models by modifying sparse weights—remaining stealthy under distribution shifts. * LLMs as debaters and judges – MArgE builds argument trees across multiple models to verify claims, outperforming single-LLM setups; Refine-n-Judge uses a single model to simulate both human refinement and scoring in preference learning pipelines. * Autonomous agents in motion – UROSA deploys distributed LLMs on underwater robots with real-time cognition; L3M+P pairs lifelong planning with knowledge graphs for service robotics. * RAG, revisited – Temporal GraphRAG tackles stale or redundant knowledge by modeling time-aware retrieval; CoCoA boosts multi-hop QA by harmonizing LLM memory and external context; Meta-RAG uses code summarization to navigate and debug large codebases. * LLMs optimizing LLM infrastructure – CRINN reframes nearest-neighbor search as a reinforcement learning problem, showing that models can now help tune the very algorithms that serve them. From fingerprints to federated learning, memory graphs to metaphorical puzzles, this episode maps out the frontier of how we build, protect, and operationalize language models. Sources: * https://doi.org/10.48550/arXiv.2412.13795 [https://doi.org/10.48550/arXiv.2412.13795] * https://doi.org/10.48550/arXiv.2412.15113 [https://doi.org/10.48550/arXiv.2412.15113] * https://doi.org/10.48550/arXiv.2507.22940 [https://doi.org/10.48550/arXiv.2507.22940] * https://doi.org/10.48550/arXiv.2507.23735 [https://doi.org/10.48550/arXiv.2507.23735] * https://doi.org/10.48550/arXiv.2508.01198 [https://doi.org/10.48550/arXiv.2508.01198] * https://doi.org/10.48550/arXiv.2508.01306 [https://doi.org/10.48550/arXiv.2508.01306] * https://doi.org/10.48550/arXiv.2508.01543 [https://doi.org/10.48550/arXiv.2508.01543] * https://doi.org/10.48550/arXiv.2508.01680 [https://doi.org/10.48550/arXiv.2508.01680] * https://doi.org/10.48550/arXiv.2508.01696 [https://doi.org/10.48550/arXiv.2508.01696] * https://doi.org/10.48550/arXiv.2508.01908 [https://doi.org/10.48550/arXiv.2508.01908] * https://doi.org/10.48550/arXiv.2508.01917 [https://doi.org/10.48550/arXiv.2508.01917] * https://doi.org/10.48550/arXiv.2508.02091 [https://doi.org/10.48550/arXiv.2508.02091] * https://doi.org/10.48550/arXiv.2508.02092 [https://doi.org/10.48550/arXiv.2508.02092] * https://doi.org/10.48550/arXiv.2508.02584 [https://doi.org/10.48550/arXiv.2508.02584] * https://doi.org/10.48550/arXiv.2508.02611 [https://doi.org/10.48550/arXiv.2508.02611]

6 aug 2025 - 43 min

Jailbreaks, Collaboration, and Cognitive Shifts

Generated by Google NotebookLM. This episode explores 15 new research papers at the edge of LLM behavior, safety, collaboration, and reasoning: * Beyond passive replies – CollabLLM rethinks how LLMs interact across turns, training them to uncover user intent and proactively collaborate. * Red teaming, automated – RedCoder weaponizes multi-turn attacks against code models, training autonomous agents to probe for unsafe generations. * Synthesis by simulation – CodeEvo builds training data by pairing coder and reviewer agents in feedback loops, automating high-quality instruction-code generation. * Internal deception – Linear probes and SAEs reveal how truthful features flip when models are prompted to lie. * Defense by deflection – SDeflection avoids refusal and instead rewrites malicious prompts into innocuous replies, lowering jailbreak success without hurting helpfulness. * Attack by persona – A genetic algorithm crafts persona prompts that reduce refusal rates and supercharge jailbreaks, especially when stacked with other methods. * Agents with evolving maps – CoEx lets planning agents continually revise their world models, co-adapting structure and strategy over time. * Interfaces for oversight – Magentic-UI powers human-in-the-loop agentic systems with long-term memory, action guards, and collaborative controls. * Measuring long-context reasoning – NeedleChain moves past “needle-in-a-haystack” with tasks that require full semantic integration across long input windows. * Bias as an exploit – CognitiveAttack uncovers how stacking psychological biases in prompts dramatically increases LLM jailbreak success. * Patching with logic – RePaCA guides LLMs to assess bug fixes using chain-of-thought, boosting accuracy and explainability in patch correctness tasks. * Federated fine-tuning at scale – H2Tune handles architectural and task diversity across clients with a novel decomposition and disentanglement scheme. * Multimodal mastery – MoCHA uses sparse MoE connectors and hierarchical attention to align vision with language and reduce hallucinations. * Where demos belong – A detailed analysis of demo position bias finds that demonstration ordering in prompts drastically alters LLM accuracy and stability. Together, these papers uncover the subtle mechanics that shape LLM trustworthiness, the strategies that make or break jailbreak defenses, and the design patterns emerging in agentic interfaces and federated learning. Sources: * CollabLLM: arXiv:2406.04425 * RedCoder: arXiv:2407.00482 * CodeEvo: arXiv:2407.00483 * When Truthful Representations Flip Under Deceptive Instructions: arXiv:2407.00495 * Strategic Deflection: arXiv:2407.00496 * Enhancing Jailbreak Attacks via Persona Prompts: arXiv:2407.00499 * CoEx: arXiv:2407.00508 * Magentic-UI: arXiv:2407.00510 * NeedleChain: arXiv:2407.00518 * CognitiveAttack: arXiv:2407.00519 * RePaCA: arXiv:2407.00523 * H2Tune: arXiv:2407.00529 * MoCHA: arXiv:2407.00530 * Where to show Demos in Your Prompt: arXiv:2407.00533

31 jul 2025 - 1 h 2 min

Planning Agents, Emotional Bias, and Trustworthy Responses

Generated with Google NotebookLM. This episode dives into 16 cutting-edge papers that reimagine how LLMs plan, adapt, reason—and stay safe doing it: * Planning meets population play – STRATEGIST lets LLMs refine high-level strategies via text and execute them with Monte Carlo precision, rivaling humans in multi-turn games. * Does tone steer truth? – A systematic study finds GPT-4 resists negative prompt bias—until it doesn’t—revealing tone-induced semantic drift and suppressed emotional alignment. * Geometric insight – Curved Inference tracks how prompts bend the LLM’s residual stream, exposing layers of latent concern and meaning through salience and curvature. * Smarter retrieval, lighter load – SemRAG blends semantic chunking with knowledge graphs to turbocharge domain-specific RAG without the finetuning tax. * Visual agents that learn – VizGenie evolves itself through LLM-generated code and VQA, slashing overhead in scientific visualization tasks. * Tech mapping on autopilot – RATE uses LLMs to extract and validate key tech terms from papers, building networks that outperform BERT-based extractors by 70% F1. * Trust in high-stakes moments – Some models play it safe; others don’t. Sycophancy, clarifying questions, and activation vectors reveal how cautious AI can be shaped. * Guardrails, reimagined – OneShield provides a plug-and-play compliance layer to tailor LLM behavior across privacy, ethics, and safety. * Built-in sabotage defense – SDD defangs malicious fine-tuning by teaching models to answer harmful prompts with elegant irrelevance. * Wireless compositionality – ContextLoRA and ContextGear let one LLM handle multiple multimodal mobile tasks efficiently, backed by task graphs and fine-tuned adaptation. * Measuring uncertainty—properly – A Shapley-based metric replaces naive entropy to better predict when LLMs are bluffing. * Structure for thinking agents – Graph-Augmented LLM Agents use graphs for better planning, tool use, memory, and MAS coordination. * Due diligence done right – A rigorous RAG evaluation protocol blends human and LLM judgment for statistical reliability—perfect for finance and healthcare use cases. * RL, no humans required – RLSF lets models learn from their own confidence levels, improving calibration and reasoning without labels or gold data. * LLMs that plan on phones – MapAgent builds page memory from task traces to navigate mobile UIs with fine-grained, trajectory-aware precision. These papers showcase a new class of agents: introspective, modular, cautious, and capable of evolving workflows across scientific, mobile, and safety-critical contexts. Sources: https://doi.org/10.48550/arXiv.2408.10635 [https://doi.org/10.48550/arXiv.2408.10635] https://doi.org/10.48550/arXiv.2507.21083 [https://doi.org/10.48550/arXiv.2507.21083] https://doi.org/10.48550/arXiv.2507.21107 [https://doi.org/10.48550/arXiv.2507.21107] https://doi.org/10.48550/arXiv.2507.21110 [https://doi.org/10.48550/arXiv.2507.21110] https://doi.org/10.48550/arXiv.2507.21124 [https://doi.org/10.48550/arXiv.2507.21124] https://doi.org/10.48550/arXiv.2507.21125 [https://doi.org/10.48550/arXiv.2507.21125] https://doi.org/10.48550/arXiv.2507.21132 [https://doi.org/10.48550/arXiv.2507.21132] https://doi.org/10.48550/arXiv.2507.21170 [https://doi.org/10.48550/arXiv.2507.21170] https://doi.org/10.48550/arXiv.2507.21182 [https://doi.org/10.48550/arXiv.2507.21182] https://doi.org/10.48550/arXiv.2507.21199 [https://doi.org/10.48550/arXiv.2507.21199] https://doi.org/10.48550/arXiv.2507.21406 [https://doi.org/10.48550/arXiv.2507.21406] https://doi.org/10.48550/arXiv.2507.21407 [https://doi.org/10.48550/arXiv.2507.21407] https://doi.org/10.48550/arXiv.2507.21753 [https://doi.org/10.48550/arXiv.2507.21753] https://doi.org/10.48550/arXiv.2507.21931 [https://doi.org/10.48550/arXiv.2507.21931] https://doi.org/10.48550/arXiv.2507.21953 [https://doi.org/10.48550/arXiv.2507.21953]

30 jul 2025 - 1 h 16 min

Factuality, Alignment, and Edge Efficiency

Generated with Google NotebookLM This week’s roundup distills 15 brand‑new arXiv papers that are bending the curve on large‑language‑model accuracy, efficiency, and safety: * Truth under pressure – A RAG‑powered adversarial pipeline shreds GPT‑4o’s fact‑checker, proving that evaluators need retrieval too. * API docs, minus the bloat – Smart chunking plus a “Discovery Agent” trims OpenAPI specs while raisingendpoint recall. * Alignment, re‑weighted – FocalPO boosts Direct Preference Optimisation by doubling‑down on pairs the model already ranks right. * Seeing, thinking, scheming – MultiMind merges facial cues, vocal tone, Theory‑of‑Mind, and MCTS to out‑bluff humans in Werewolf. * Token thrift as design law – A manifesto argues that pruning isn’t just for speed; it cuts hallucinations and stabilises training. * Cheaper RL finetunes – MoPPS predicts prompt difficulty on‑the‑fly and slashes rollout counts. * Edge‑ready inference – DeltaLLM exploits temporal sparsity, while HCAttention squeezes KV cache to 25 %—letting Llama‑3‑8B read 4 M tokens on a single A100. * LLMs that draw – A ReAct + RAG agent converts natural‑language briefs straight into AutoCAD code. * Tool orchestration at scale – SciToolAgent uses a knowledge‑graph spine to automate hundreds of domain‑specific apps. * Where models get lost – MazeEval exposes huge language‑bound gaps in spatial navigation. * Red‑team reality check – 1.8 M attacks show nearly every frontier agent breaks policy within 100 prompts; robustness ≠ size. * Proving corrigibility – Five lexicographic “core safety values” deliver the first provable obedience guarantees. * Open‑source powerhouse – Kimi K2 (32 B MoE / 1 T tokens) tops agentic leaderboards with a new MuonClip optimiser. From adversarial fact‑checking to provably safe utility heads, these papers reveal the state of the art—and the cracks that still need sealing. Tune in for a 30‑minute tour of: * efficiency tricks that make billion‑param models mobile‑friendly, * alignment methods that actually move preferences, * benchmarks that stress‑test reasoning across space, language, and social strategy, and * frameworks that weld LLMs to real‑world tools without burning GPU budgets. If you build with, bet on, or just geek out over LLMs, this episode will arm you with the freshest insights—and plenty of rabbit holes for the weekend. Sources: https://arxiv.org/pdf/2410.14651 [https://arxiv.org/pdf/2410.14651] https://arxiv.org/pdf/2411.19804 [https://arxiv.org/pdf/2411.19804] https://arxiv.org/pdf/2501.06645 [https://arxiv.org/pdf/2501.06645] https://arxiv.org/pdf/2504.18039 [https://arxiv.org/pdf/2504.18039] https://arxiv.org/pdf/2505.18227 [https://arxiv.org/pdf/2505.18227] https://arxiv.org/pdf/2507.04632 [https://arxiv.org/pdf/2507.04632] https://arxiv.org/pdf/2507.19608 [https://arxiv.org/pdf/2507.19608] https://arxiv.org/pdf/2507.19771 [https://arxiv.org/pdf/2507.19771] https://arxiv.org/pdf/2507.19823 [https://arxiv.org/pdf/2507.19823] https://arxiv.org/pdf/2507.20280 [https://arxiv.org/pdf/2507.20280] https://arxiv.org/pdf/2507.20395 [https://arxiv.org/pdf/2507.20395] https://arxiv.org/pdf/2507.20526 [https://arxiv.org/pdf/2507.20526] https://arxiv.org/pdf/2507.20534 [https://arxiv.org/pdf/2507.20534] https://arxiv.org/pdf/2507.20796 [https://arxiv.org/pdf/2507.20796] https://arxiv.org/pdf/2507.20964 [https://arxiv.org/pdf/2507.20964]

30 jul 2025 - 49 min

Super app. Onthoud waar je bent gebleven en wat je interesses zijn. Heel veel keuze!

Makkelijk in gebruik!

App ziet er mooi uit, navigatie is even wennen maar overzichtelijk.

Kies je abonnement

Meest populair

Premium

20 uur aan luisterboeken