AI Post Transformers
This episode explores a paper on inference-time scaling for coding agents, asking whether extra test-time compute still helps when tasks are long, messy, and require multi-step tool use rather than a single code completion. It focuses on the paper’s main argument that the real bottleneck is not generating more rollout attempts, but representing prior attempts well enough to compare, select, and reuse them, with structured trajectory summaries serving as the key middle layer between raw transcripts and final patches. The discussion examines two mechanisms: a parallel “tournament” style selection method over summaries, and a sequential refinement method that conditions later attempts on distilled lessons from earlier ones. Listeners would find it interesting because the conversation connects agent performance gains to practical questions of context management, selection versus reuse, and whether the reported improvements reflect a deep scaling insight or simply better engineering around long-horizon coding workflows. Sources: 1. Scaling Test-Time Compute for Agentic Coding — Joongwon Kim, Wannan Yang, Kelvin Niu, Hongming Zhang, Yun Zhu, Eryk Helenowski, Ruan Silva, Zhengxing Chen, Srinivasan Iyer, Manzil Zaheer, Daniel Fried, Hannaneh Hajishirzi, Sanjeev Arora, Gabriel Synnaeve, Ruslan Salakhutdinov, Anirudh Goyal, 2026 http://arxiv.org/abs/2604.16529 2. ReAct: Synergizing Reasoning and Acting in Language Models — Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao, 2023 https://scholar.google.com/scholar?q=ReAct:+Synergizing+Reasoning+and+Acting+in+Language+Models 3. Reflexion: Language Agents with Verbal Reinforcement Learning — Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao, 2023 https://scholar.google.com/scholar?q=Reflexion:+Language+Agents+with+Verbal+Reinforcement+Learning 4. ExpeL: LLM Agents Are Experiential Learners — Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, Gao Huang, 2023 https://scholar.google.com/scholar?q=ExpeL:+LLM+Agents+Are+Experiential+Learners 5. Rethinking Thinking Tokens: LLMs as Improvement Operators — Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, Anirudh Goyal, 2025 https://scholar.google.com/scholar?q=Rethinking+Thinking+Tokens:+LLMs+as+Improvement+Operators 6. CodeMonkeys: Scaling Test-Time Compute for Software Engineering — Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Re, Azalia Mirhoseini, 2025 https://scholar.google.com/scholar?q=CodeMonkeys:+Scaling+Test-Time+Compute+for+Software+Engineering 7. S*: Test Time Scaling for Code Generation — Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica, 2025 https://scholar.google.com/scholar?q=S*:+Test+Time+Scaling+for+Code+Generation 8. Scaling Test-time Compute for LLM Agents — King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, Changwang Zhang, Chenghua Lin, Jun Wang, Ge Zhang, Wangchunshu Zhou, 2025 https://scholar.google.com/scholar?q=Scaling+Test-time+Compute+for+LLM+Agents 9. Agentic Test-Time Scaling for WebAgents — Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 2026 https://scholar.google.com/scholar?q=Agentic+Test-Time+Scaling+for+WebAgents 10. Does SWE-Bench-Verified Test Agent Ability or Model Memory? — Thanosan Prathifkumar, Noble Saji Mathews, Meiyappan Nagappan, 2025 https://scholar.google.com/scholar?q=Does+SWE-Bench-Verified+Test+Agent+Ability+or+Model+Memory? 11. A Benchmark for Procedural Memory Retrieval in Language Agents — Ishant Kohar, Aswanth Krishnan, 2025 https://scholar.google.com/scholar?q=A+Benchmark+for+Procedural+Memory+Retrieval+in+Language+Agents 12. PROCED-MEM: Benchmarking Procedural Memory Retrieval in Language Agents Across Domains — Ishant Kohar, Aswanth Krishnan, 2026 https://scholar.google.com/scholar?q=PROCED-MEM:+Benchmarking+Procedural+Memory+Retrieval+in+Language+Agents+Across+Domains 13. G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems — Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, Shuicheng Yan, 2025 https://scholar.google.com/scholar?q=G-Memory:+Tracing+Hierarchical+Memory+for+Multi-Agent+Systems 14. Scaling Agentic Verifier for Competitive Coding — Zeyao Ma et al., 2026 https://scholar.google.com/scholar?q=Scaling+Agentic+Verifier+for+Competitive+Coding 15. AgentPro: Enhancing LLM Agents with Automated Process Supervision — Yuchen Deng, Shichen Fan, Naibo Wang, Xinkui Zhao, See-Kiong Ng, 2025 https://scholar.google.com/scholar?q=AgentPro:+Enhancing+LLM+Agents+with+Automated+Process+Supervision 16. Recursive Introspection: Teaching Language Model Agents How to Self-Improve — Yuxiao Qu, Tianjun Zhang, Naman Garg, Aviral Kumar, 2024 https://scholar.google.com/scholar?q=Recursive+Introspection:+Teaching+Language+Model+Agents+How+to+Self-Improve 17. Agentic Refactoring: An Empirical Study of AI Coding Agents — Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, Ahmed E. Hassan, 2025 https://scholar.google.com/scholar?q=Agentic+Refactoring:+An+Empirical+Study+of+AI+Coding+Agents 18. AI Post Transformers: TMAS: Scaling Test-Time Compute with Multi-Agent Synergy — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-14-tmas-scaling-test-time-compute-with-mult-3abe7a.mp3 19. AI Post Transformers: Benchmarking Test-Time Scaling for General LLM Agents — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-22-benchmarking-test-time-scaling-for-gener-8f14f9.mp3 20. AI Post Transformers: MiA-Signature and Global Activation for Long Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-mia-signature-and-global-activation-for-5ad62f.mp3 21. AI Post Transformers: Explicit Information Transmission for Context Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-explicit-information-transmission-for-co-24e3c2.mp3 Interactive Visualization: Trajectory Summaries for Long-Horizon Coding Agents [https://podcast.do-not-panic.com/viz/2026-05-24-trajectory-summaries-for-long-horizon-co-0194be.html]
663 afleveringen
Reacties
0Wees de eerste die een reactie plaatst
Meld je nu aan en word lid van de AI Post Transformers community!