Trajectory Summaries for Long-Horizon Coding Agents
This episode explores a paper on inference-time scaling for coding agents, asking whether extra test-time compute still helps when tasks are long, messy, and require multi-step tool use rather than a single code completion. It focuses on the paper’s main argument that the real bottleneck is not generating more rollout attempts, but representing prior attempts well enough to compare, select, and reuse them, with structured trajectory summaries serving as the key middle layer between raw transcripts and final patches. The discussion examines two mechanisms: a parallel “tournament” style selection method over summaries, and a sequential refinement method that conditions later attempts on distilled lessons from earlier ones. Listeners would find it interesting because the conversation connects agent performance gains to practical questions of context management, selection versus reuse, and whether the reported improvements reflect a deep scaling insight or simply better engineering around long-horizon coding workflows.
Sources:
1. Scaling Test-Time Compute for Agentic Coding — Joongwon Kim, Wannan Yang, Kelvin Niu, Hongming Zhang, Yun Zhu, Eryk Helenowski, Ruan Silva, Zhengxing Chen, Srinivasan Iyer, Manzil Zaheer, Daniel Fried, Hannaneh Hajishirzi, Sanjeev Arora, Gabriel Synnaeve, Ruslan Salakhutdinov, Anirudh Goyal, 2026
http://arxiv.org/abs/2604.16529
2. ReAct: Synergizing Reasoning and Acting in Language Models — Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao, 2023
https://scholar.google.com/scholar?q=ReAct:+Synergizing+Reasoning+and+Acting+in+Language+Models
3. Reflexion: Language Agents with Verbal Reinforcement Learning — Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao, 2023
https://scholar.google.com/scholar?q=Reflexion:+Language+Agents+with+Verbal+Reinforcement+Learning
4. ExpeL: LLM Agents Are Experiential Learners — Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, Gao Huang, 2023
https://scholar.google.com/scholar?q=ExpeL:+LLM+Agents+Are+Experiential+Learners
5. Rethinking Thinking Tokens: LLMs as Improvement Operators — Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, Anirudh Goyal, 2025
https://scholar.google.com/scholar?q=Rethinking+Thinking+Tokens:+LLMs+as+Improvement+Operators
6. CodeMonkeys: Scaling Test-Time Compute for Software Engineering — Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Re, Azalia Mirhoseini, 2025
https://scholar.google.com/scholar?q=CodeMonkeys:+Scaling+Test-Time+Compute+for+Software+Engineering
7. S*: Test Time Scaling for Code Generation — Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica, 2025
https://scholar.google.com/scholar?q=S*:+Test+Time+Scaling+for+Code+Generation
8. Scaling Test-time Compute for LLM Agents — King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, Changwang Zhang, Chenghua Lin, Jun Wang, Ge Zhang, Wangchunshu Zhou, 2025
https://scholar.google.com/scholar?q=Scaling+Test-time+Compute+for+LLM+Agents
9. Agentic Test-Time Scaling for WebAgents — Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 2026
https://scholar.google.com/scholar?q=Agentic+Test-Time+Scaling+for+WebAgents
10. Does SWE-Bench-Verified Test Agent Ability or Model Memory? — Thanosan Prathifkumar, Noble Saji Mathews, Meiyappan Nagappan, 2025
https://scholar.google.com/scholar?q=Does+SWE-Bench-Verified+Test+Agent+Ability+or+Model+Memory?
11. A Benchmark for Procedural Memory Retrieval in Language Agents — Ishant Kohar, Aswanth Krishnan, 2025
https://scholar.google.com/scholar?q=A+Benchmark+for+Procedural+Memory+Retrieval+in+Language+Agents
12. PROCED-MEM: Benchmarking Procedural Memory Retrieval in Language Agents Across Domains — Ishant Kohar, Aswanth Krishnan, 2026
https://scholar.google.com/scholar?q=PROCED-MEM:+Benchmarking+Procedural+Memory+Retrieval+in+Language+Agents+Across+Domains
13. G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems — Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, Shuicheng Yan, 2025
https://scholar.google.com/scholar?q=G-Memory:+Tracing+Hierarchical+Memory+for+Multi-Agent+Systems
14. Scaling Agentic Verifier for Competitive Coding — Zeyao Ma et al., 2026
https://scholar.google.com/scholar?q=Scaling+Agentic+Verifier+for+Competitive+Coding
15. AgentPro: Enhancing LLM Agents with Automated Process Supervision — Yuchen Deng, Shichen Fan, Naibo Wang, Xinkui Zhao, See-Kiong Ng, 2025
https://scholar.google.com/scholar?q=AgentPro:+Enhancing+LLM+Agents+with+Automated+Process+Supervision
16. Recursive Introspection: Teaching Language Model Agents How to Self-Improve — Yuxiao Qu, Tianjun Zhang, Naman Garg, Aviral Kumar, 2024
https://scholar.google.com/scholar?q=Recursive+Introspection:+Teaching+Language+Model+Agents+How+to+Self-Improve
17. Agentic Refactoring: An Empirical Study of AI Coding Agents — Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, Ahmed E. Hassan, 2025
https://scholar.google.com/scholar?q=Agentic+Refactoring:+An+Empirical+Study+of+AI+Coding+Agents
18. AI Post Transformers: TMAS: Scaling Test-Time Compute with Multi-Agent Synergy — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-05-14-tmas-scaling-test-time-compute-with-mult-3abe7a.mp3
19. AI Post Transformers: Benchmarking Test-Time Scaling for General LLM Agents — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-22-benchmarking-test-time-scaling-for-gener-8f14f9.mp3
20. AI Post Transformers: MiA-Signature and Global Activation for Long Context — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-05-13-mia-signature-and-global-activation-for-5ad62f.mp3
21. AI Post Transformers: Explicit Information Transmission for Context Compression — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-05-05-explicit-information-transmission-for-co-24e3c2.mp3
Interactive Visualization: Trajectory Summaries for Long-Horizon Coding Agents [https://podcast.do-not-panic.com/viz/2026-05-24-trajectory-summaries-for-long-horizon-co-0194be.html]
Kommentarer
0Vær den første til å kommentere
Registrer deg nå og bli medlem av AI Post Transformers sitt community!