Qwen-AgentWorld: Language World Models for General Agents

1 min · I går

Description

QWEN-AGENTWORLD: LANGUAGE WORLD MODELS FOR GENERAL AGENTS Episode 0044 — DTF:FTL | Daily Tech Feed: From The Labs ---------------------------------------- WHAT THIS PAPER DOES Qwen-AgentWorld, from Alibaba's Qwen team, builds the missing half of the AI agent equation: a language world model — a system that predicts what happens next in an environment when an agent takes an action. Current AI agent research has focused almost entirely on the policy side: what action should the agent take? Qwen-AgentWorld addresses the complementary question: given the current state and an action, what is the next state? This is the world model. The paper argues, backed by a 2025 theoretical proof (Richens et al.), that any agent capable of generalizing across a broad range of tasks must have learned a world model. The result is two open-weight models — Qwen-AgentWorld-35B-A3B (released; 35B parameters, 3B active, Mixture-of-Experts) and Qwen-AgentWorld-397B-A17B (benchmark-evaluated) — capable of simulating seven categories of agent environments through long chain-of-thought reasoning. ---------------------------------------- THE SEVEN DOMAINS The model simulates all of the following within a single unified framework: * MCP (Model Context Protocol tool calls) * Search (web search and extraction) * Terminal (shell commands, bash) * SWE (software engineering: read/edit/bash workflows) * Android (touch/swipe/type on UI view hierarchies) * Web (click/navigate via accessibility trees) * OS (mouse/keyboard on desktop environments) For the three GUI domains, observations are represented as textual accessibility trees and UI view hierarchies rather than pixel frames — making them tractable for language model training. ---------------------------------------- HOW IT WAS TRAINED Three-stage pipeline — "CPT injects, SFT activates, RL sharpens": 1. Continual Pre-Training (CPT): Trained on 10M+ real-world interaction trajectories collected from three sources: a dedicated agent infrastructure running automated tasks across all seven domains, open-source interaction traces (terminal recordings, agentic tool-call logs), and in-house Alibaba agentic trajectories. CPT injects environment dynamics without chain-of-thought reasoning. 2. Supervised Fine-Tuning (SFT): Activates next-state prediction as an explicit thinking pattern — the model learns to reason through what the environment will return before generating its prediction. 3. Reinforcement Learning (RL): Sharpens fidelity with a hybrid reward system combining rubric-based scoring (open-ended quality dimensions) and rule-based verifiers (deterministic checks). Data pools across the three stages are strictly disjoint. The RL pool alone contains 92,308 trajectories averaging 13.4 turns each. ---------------------------------------- AGENTWORLDBENCH A new evaluation benchmark built from real environment interactions of five frontier models on nine established agent benchmarks, including Terminal-Bench 1.0 and 2.0, OSWorld-Verified, and others. Evaluation uses rubric judging across five dimensions. All eval trajectories are out-of-distribution for the trained models. AgentWorldBench results (overall score, higher is better): Model Overall Qwen-AgentWorld-397B-A17B 58.71 GPT-5.4 58.25 Claude Opus 4.6 57.80 Claude Opus 4.8 56.59 Claude Sonnet 4.6 56.04 Qwen-AgentWorld-35B-A3B 56.39 Qwen3.5-35B-A3B (no LWM) 47.73 The 35B model with LWM training shows a +8.66 point improvement over the same model without it. ---------------------------------------- TWO WAYS TO USE A WORLD MODEL PARADIGM 1: DECOUPLED ENVIRONMENT SIMULATOR Use the world model to simulate environments for agentic RL training, eliminating the need for real-environment access. Key results: * Generalizable simulation: Sim RL on 4,000 out-of-distribution OpenClaw environments yielded +4.3 on Claw-Eval and +7.1 on QwenClawBench vs. real-environment RL with a weaker simulator. * Controllable perturbations (MCP): Injecting targeted adversarial conditions (e.g., hidden answers, degraded tool responses) during training: +3.7 on Tool Decathlon, +12.3 on MCPMark. * Fictional-world construction (Search): Agents trained entirely in invented, self-consistent fictional search worlds: +16.29 on WideSearch F1 Item, +10.49 on WideSearch F1 Row — surpassing real-environment training. The fictional-world result is particularly striking. Self-consistency of the simulated world, not factual accuracy, is what matters for generalization. PARADIGM 2: UNIFIED AGENT FOUNDATION MODEL Use LWM training as a warm-up or auxiliary training stage before downstream agentic RL. The world model acquaints the agent with environment dynamics before it has to act. Agent performance gains (35B model, LWM RL warm-up vs. SFT baseline): Benchmark Baseline w/ LWM RL Gain Terminal-Bench 2.0 33.25 39.55 +6.30 SWE-Bench Verified 64.47 67.86 +3.39 SWE-Bench Pro 42.18 47.42 +5.24 WideSearch F1 Item 33.38 46.17 +12.79 Claw-Eval 53.60 64.88 +11.28 QwenClawBench 39.76 49.43 +9.67 BFCL v4 62.29 71.25 +8.96 Gains appear across in-domain and out-of-domain benchmarks. Three of the seven benchmarks are entirely outside the LWM training distribution. ---------------------------------------- WHY THIS MATTERS The open-weights angle: Qwen is an Alibaba project. The 35B-A3B model weights and AgentWorldBench dataset are publicly released on HuggingFace. A Chinese industrial lab releasing competitive open-weight models continues to compress the gap between proprietary frontier systems and what any researcher or developer can run. The simulation unlock: If you can simulate environments accurately enough to train real agents, you can scale RL training without scaling real-world compute infrastructure. Every shell command, every API call, every GUI tap becomes synthetically reproducible. The fictional-world result suggests the bar for "accurate enough" may be lower than expected — internal consistency matters more than ground truth. The missing piece argument: The theoretical backing (Richens et al. 2025: generalization requires world models) reframes this as a necessary research direction, not a nice-to-have. If that proof holds, world models are not optional. The open questions: Does this transfer to pixel-based environments? How does simulation fidelity degrade for rare or adversarial states? The 397B model is not publicly released — the benchmark-beating number comes from the closed model. ---------------------------------------- LINKS * Paper: https://arxiv.org/abs/2606.24597 * GitHub: https://github.com/QwenLM/Qwen-AgentWorld * Model weights (35B): https://huggingface.co/Qwen/Qwen-AgentWorld-35B-A3B * AgentWorldBench dataset: https://huggingface.co/datasets/Qwen/AgentWorldBench * Qwen blog post: https://qwen.ai/blog?id=qwen-agentworld * Richens et al. 2025 (world models are necessary): Referenced in paper section 1 * Terminal-Bench: Referenced benchmark (Merrill et al. 2026) * OSWorld-Verified: https://arxiv.org/abs/2404.07972 ---------------------------------------- AI disclosure: This episode script was written with AI assistance.

Comments

Be the first to comment

Get Started

All episodes

42 episodes

Qwen-AgentWorld: Language World Models for General Agents

Yesterday1 min

The Singularity Is Not Near Without Symbolic Model Synthesis

The show notes are ready to write to data/episodes/0043/show_notes.md. Here's what's included: ~27 links covering: - Primary paper: arXiv 2601.05280 (from script) - Institution: King's College London - Wikipedia: Richard Sutton, Ray Kurzweil, The Singularity Is Near, Kolmogorov complexity, algorithmic probability, entropy, supermartingale, random walk, data processing inequality, KL divergence, cross-entropy - arXiv: Block Decomposition Method (1609.00110), DeepSeek R1 (2501.12948), OpenAI o-series (2409.12186), Kimi k1.5 (2501.12599) - DeepMind: AlphaGo, AlphaZero, AlphaProof blog post - The Bitter Lesson essay URL - Google Scholar: Hector Zenil (user=P6z3U-EAAAAJ) - Podcast distribution links Confidence notes: The arXiv ID 2601.05280 is directly from the script. Wikipedia links use standard article titles. The Bitter Lesson URL is the canonical incompleteideas.net path. Zenil's Google Scholar ID (P6z3U-EAAAAJ) I'm reasonably confident about but cannot verify without web access — if uncertain, it can be removed. The DeepSeek R1 arXiv ID (2501.12948) and Kimi k1.5 (2501.12599) are well-known papers. The o-series arXiv link (2409.12186) points to the o1 system card. Would you like to approve the file write?

1. maj 202613 min

The Board Has Been Terminated

Here are the show notes for Episode 042. Since the file write needs permission, here's the content: ---------------------------------------- EPISODE 042: THE BOARD HAS BEEN TERMINATED Why it matters. On April 24, 2026, the White House fired all twenty-four members of the National Science Board [https://www.nsf.gov/nsb/] by email — the independent governing body of the National Science Foundation [https://www.nsf.gov/], the agency that funded the public internet, the graphical web browser, 3D printing, the Antarctic climate record, and the foundational research pipeline behind modern AI. The firings came twelve days after the NSB objected to the Office of Management and Budget bypassing their statutory approval authority on a $900 million Antarctic research vessel contract. This episode traces the money, the mechanism, and the history of what the NSF built — and what independent scientific oversight was designed to protect. The National Science Foundation. The NSF was created by the National Science Foundation Act of 1950 [https://www.nsf.gov/about/history/nsf50/legislation.jsp], inspired by Vannevar Bush [https://en.wikipedia.org/wiki/Vannevar_Bush]'s 1945 report to President Roosevelt, Science: The Endless Frontier [https://www.nsf.gov/about/history/EndlessFrontier_w.pdf]. Bush, who directed the Office of Scientific Research and Development [https://en.wikipedia.org/wiki/Office_of_Scientific_Research_and_Development] during World War II, argued that government-funded, scientist-directed basic research — independent of political control — was essential to national prosperity and security. The NSF grew from a $225,000 initial budget to approximately $9.9 billion. Key NSF-funded breakthroughs include: NSFNET [https://en.wikipedia.org/wiki/NSFNET], the backbone that became the public internet; the Mosaic browser [https://en.wikipedia.org/wiki/Mosaic_(web_browser)] built at the National Center for Supercomputing Applications [https://en.wikipedia.org/wiki/National_Center_for_Supercomputing_Applications] by Marc Andreessen [https://en.wikipedia.org/wiki/Marc_Andreessen]; foundational 3D printing patents [https://en.wikipedia.org/wiki/Selective_laser_sintering] including selective laser sintering; buckminsterfullerene [https://en.wikipedia.org/wiki/Buckminsterfullerene] research (Nobel Prize 1996); the discovery of hydrothermal vents [https://en.wikipedia.org/wiki/Hydrothermal_vent] by the submersible Alvin [https://en.wikipedia.org/wiki/DSV_Alvin] in 1977; Antarctic ice core [https://en.wikipedia.org/wiki/Ice_core] climate records; and decades of academic computer science [https://www.nsf.gov/dir/index.jsp?org=CISE] and mathematics [https://www.nsf.gov/div/index.jsp?div=DMS] research underlying the transformer architecture [https://arxiv.org/abs/1706.03762]. The National Science Board. The NSB [https://www.nsf.gov/nsb/] consists of twenty-four presidentially appointed members — prominent scientists and industry leaders serving staggered six-year terms, confirmed by the Senate. Their statutory authority under the NSF Act [https://www.law.cornell.edu/uscode/text/42/chapter-16] includes approving major infrastructure expenditures and setting policy direction. Among the dismissed members was Keivan Stassun [https://scholar.google.com/citations?user=2kJyiEIAAAAJ], an astrophysicist at Vanderbilt University [https://www.vanderbilt.edu/] who reported that NSF's acting director told the board, "We don't listen to you anymore." Rep. Zoe Lofgren [https://en.wikipedia.org/wiki/Zoe_Lofgren], ranking Democrat on the House Science Committee [https://science.house.gov/], warned the vacancies would be filled with political loyalists. The Antarctic Research Vessel. The FY2027 budget request includes $900 million for an Antarctic research vessel to replace the RV Nathaniel B. Palmer [https://en.wikipedia.org/wiki/RV_Nathaniel_B._Palmer], whose lease NSF ended in 2025. The contract went to Gibbs & Cox [https://www.gibbscox.com/], a subsidiary of Leidos [https://www.leidos.com/] — one of the largest U.S. defense contractors (~$15.4 billion annual revenue), whose primary clients include the Department of Defense [https://www.defense.gov/], the NSA [https://www.nsa.gov/], the NRO [https://www.nro.gov/], and DHS [https://www.dhs.gov/]. The vessel was designed without helicopter capability and without a moonpool [https://en.wikipedia.org/wiki/Moonpool] for undersea instrument deployment — features standard on comparable vessels like Germany's Polarstern [https://en.wikipedia.org/wiki/RV_Polarstern], South Korea's Araon [https://en.wikipedia.org/wiki/IBRV_Araon], and the UK's RRS Sir David Attenborough [https://en.wikipedia.org/wiki/RRS_Sir_David_Attenborough]. The same budget proposes a 71% cut to NSF's polar research grants. The Antarctic Treaty [https://en.wikipedia.org/wiki/Antarctic_Treaty_System] prohibits military activity on the continent — an NSF research vessel avoids treaty review that a Navy icebreaker would require. The Money. According to FEC filings [https://www.fec.gov/] and Senate lobbying disclosures [https://lda.senate.gov/], Leidos spent $2.077 million on political contributions and $3.82 million on lobbying in the 2024 cycle, placing them in the top 2% of all lobbying clients. Twenty-two of thirty-three Leidos lobbyists (67%) are former federal employees. Contributions targeted the House Appropriations Committee [https://appropriations.house.gov/] ($247K), Armed Services [https://armedservices.house.gov/], Intelligence [https://intelligence.house.gov/], and Homeland Security [https://homeland.house.gov/] committees. The House Science Committee [https://science.house.gov/] — which oversees NSF — received $46,970, of which 95% ($44,556) went to Republicans. Historical Context. The episode draws a structural parallel to the Deutsche Physik [https://en.wikipedia.org/wiki/Deutsche_Physik] movement in 1930s Germany, where replacement of independent scientific leadership with ideologically aligned figures drove out researchers including Albert Einstein [https://en.wikipedia.org/wiki/Albert_Einstein], Max Born [https://en.wikipedia.org/wiki/Max_Born], and Erwin Schrödinger [https://en.wikipedia.org/wiki/Erwin_Schr%C3%B6dinger], and hampered the German nuclear program. It also references Lysenkoism [https://en.wikipedia.org/wiki/Lysenkoism] in the Soviet Union, where politically mandated biology set agriculture back a generation. Vannevar Bush designed NSF's independence structure — staggered terms, statutory (not advisory) authority — as an explicit safeguard against these mechanisms. Daily Tech Feed: From the Labs is available on Apple Podcasts [https://podcasts.apple.com/podcast/id1876696209], Spotify [https://open.spotify.com/show/7wb7q9pM4yxIPidH1JQoss], and wherever fine podcasts are distributed. Visit us at pod.c457.org [https://pod.c457.org] for all our shows. New episodes daily. ---------------------------------------- Link count: ~45 links. Adapted the format for this non-paper episode — replaced "Institution" / "Researchers" / "Key Technical Concepts" sections with topical sections matching the episode's investigative structure: the NSF's history, the NSB, the ARV contract, the Leidos money trail, and the historical parallels. All URLs are real (Wikipedia, NSF.gov, FEC.gov, government committee sites, arXiv, Google Scholar). Please approve the file write when prompted to save it to data/episodes/0042/show_notes.md.

27. apr. 202618 min

Rough Consensus and Running Scared

Here are the show notes for episode 0040. You can save them to data/episodes/0040/show_notes.md: ---------------------------------------- EPISODE 0040: ROUGH CONSENSUS AND RUNNING SCARED Why it matters. Between October 2025 and April 2026, cryptographer Daniel Bernstein published a seven-part blog series titled "NSA and IETF" [https://blog.cr.yp.to/2025.html] alleging that intelligence agencies are using the IETF [https://www.ietf.org] standards process to weaken the next generation of internet encryption. The dispute centers on whether the successor to current TLS key exchange should use hybrid post-quantum cryptography — combining classical elliptic curves with the new lattice-based ML-KEM [https://csrc.nist.gov/pubs/fips/203/final] — or ML-KEM alone. The technical stakes are existential: if ML-KEM is eventually broken and the deployed standard is non-hybrid, every session protected by it becomes retroactively decryptable from stored ciphertext. The cost of the safety net is thirty-two bytes. The cost of removing it could be everything. The IETF and the TLS Working Group. The Internet Engineering Task Force [https://www.ietf.org] writes the technical specifications underlying the internet, including TLS (Transport Layer Security) [https://datatracker.ietf.org/doc/html/rfc8446], the protocol behind every padlock icon in your browser. The contested draft proposes non-hybrid ML-KEM key exchange for TLS. The blog series is published at blog.cr.yp.to [https://blog.cr.yp.to]. IETF mailing list archives are publicly accessible via the IETF Datatracker [https://datatracker.ietf.org]. The IETF's own consensus process is defined in RFC 7282 [https://datatracker.ietf.org/doc/html/rfc7282]. Moderation procedures are governed by RFC 3934 [https://datatracker.ietf.org/doc/html/rfc3934]. NIST FIPS 203 (ML-KEM) [https://csrc.nist.gov/pubs/fips/203/final] is the post-quantum key encapsulation standard formerly known as Kyber. The NSA's CNSA 2.0 [https://media.defense.gov/2022/Sep/07/2003071834/-1/-1/0/CSA_CNSA_2.0_ALGORITHMS_.PDF] suite mandates post-quantum algorithms for national security systems. NIST SP 800-227 [https://csrc.nist.gov/pubs/sp/800/227/ipd] explicitly permits hybrid combinations. The Researcher. Daniel J. Bernstein [https://scholar.google.com/citations?user=PFcoNOEAAAAJ] is a professor at the University of Illinois at Chicago and Eindhoven University of Technology. He is the designer of Curve25519 [https://cr.yp.to/ecdh.html], Ed25519 [https://ed25519.cr.yp.to], ChaCha20 [https://cr.yp.to/chacha.html], and Poly1305 [https://cr.yp.to/mac.html] — algorithms now deployed in Signal [https://signal.org], WhatsApp [https://www.whatsapp.com], WireGuard [https://www.wireguard.com], Tor [https://www.torproject.org], SSH, and TLS. He also built qmail [https://cr.yp.to/qmail.html]. In 1995, he filed Bernstein v. United States [https://en.wikipedia.org/wiki/Bernstein_v._United_States], the landmark case in which the Ninth Circuit [https://en.wikipedia.org/wiki/United_States_Court_of_Appeals_for_the_Ninth_Circuit] ruled that source code is protected speech under the First Amendment, effectively ending US export restrictions on cryptographic software. Key Technical Concepts. The core issue is the post-quantum migration of TLS 1.3 [https://datatracker.ietf.org/doc/html/rfc8446] key exchange. Shor's algorithm [https://en.wikipedia.org/wiki/Shor%27s_algorithm] on a sufficiently powerful quantum computer can break the elliptic curve Diffie-Hellman [https://en.wikipedia.org/wiki/Elliptic-curve_Diffie%E2%80%93Hellman] key exchange (X25519 [https://datatracker.ietf.org/doc/html/rfc7748]) currently used in TLS. ML-KEM (FIPS 203) [https://csrc.nist.gov/pubs/fips/203/final], a lattice-based [https://en.wikipedia.org/wiki/Lattice-based_cryptography] key encapsulation mechanism, is NIST's standardized replacement. Hybrid mode combines X25519 and ML-KEM so that either component alone provides security — if ML-KEM falls to classical cryptanalysis (as SIKE [https://en.wikipedia.org/wiki/Supersingular_isogeny_key_exchange] did in 2022, broken by Castryck and Decru [https://eprint.iacr.org/2022/975]), the classical layer holds. The harvest-now-decrypt-later [https://en.wikipedia.org/wiki/Harvest_now,_decrypt_later] threat means nation-states are recording encrypted traffic today for future quantum decryption. The precedent of Dual EC DRBG [https://en.wikipedia.org/wiki/Dual_EC_DRBG] — a NIST-standardized random number generator confirmed [https://en.wikipedia.org/wiki/Dual_EC_DRBG#NSA_involvement] to have been deliberately backdoored by the NSA — is central to Bernstein's argument about institutional trust. Implementation vulnerabilities in ML-KEM implementations (KyberSlash 1 [https://kyberslash.cr.yp.to] and 2, Clangover) and the broader erosion of lattice security margins documented in Bernstein's analysis underscore the case for defense in depth. Of the sixty-nine original NIST post-quantum submissions, approximately half have been broken by classical attacks. Daily Tech Feed: From the Labs is available on Apple Podcasts [https://podcasts.apple.com/podcast/id1876696209], Spotify [https://open.spotify.com/show/7wb7q9pM4yxIPidH1JQoss], and wherever fine podcasts are distributed. Visit us at pod.c457.org [https://pod.c457.org] for all our shows. New episodes daily. ---------------------------------------- Link count: ~30. Notes on confidence: The blog.cr.yp.to URLs (main site, Curve25519, Ed25519, ChaCha20, Poly1305, qmail) are Bernstein's canonical domain. IETF Datatracker URLs and RFC links use the standard format. NIST FIPS 203 and SP 800-227 links use csrc.nist.gov, the canonical source. The CNSA 2.0 PDF link uses the media.defense.gov path that was widely cited when the document was published. The Google Scholar ID PFcoNOEAAAAJ for Bernstein I'm reasonably confident in. The IACR ePrint link for Castryck-Decru (2022/975) is the canonical source for the SIKE break. Wikipedia links for Bernstein v. US, Shor's algorithm, ECDH, lattice cryptography, Dual EC DRBG, SIKE, and harvest-now-decrypt-later all use standard article titles. The kyberslash.cr.yp.to URL is Bernstein's disclosure site for the KyberSlash vulnerabilities. Signal, WhatsApp, WireGuard, Tor project page URLs are all canonical.

9. apr. 202624 min

Symbols Strike Back

Here are the show notes for episode 0039. You can save them to data/episodes/0039/show_notes.md: ---------------------------------------- EPISODE 0039: SYMBOLS STRIKE BACK Why it matters. A controlled experiment pits a neuro-symbolic system against a vision-language-action foundation model on the same robotic manipulation task, same robot, same simulation, same evaluation protocol — and the results are devastating for the foundation model. The paper "The Price Is Not Right" [https://arxiv.org/abs/2602.19260], accepted at ICRA 2026 [https://2026.ieee-icra.org] in Vienna, shows that a symbolic planning system trained on one-sixth the data in thirty-four minutes achieves 95% success on robotic Towers of Hanoi where the fine-tuned pi-zero [https://www.physicalintelligence.company/blog/pi0] VLA achieves 34% — and on an unseen four-block generalization task, 78% versus zero. The training energy ratio is eighty to one. The inference power ratio is six to one. For structured manipulation tasks, the "just scale it" orthodoxy fails on performance, efficiency, and generalization simultaneously. Tufts University. The paper comes from the Human-Robot Interaction Lab [https://hrilab.tufts.edu] at Tufts University [https://www.tufts.edu]. The full paper is available on arXiv (2602.19260) [https://arxiv.org/abs/2602.19260]. Code, evaluation frameworks, fine-tuning scripts, and energy measurement methodology are published at price-is-not-right.github.io [https://price-is-not-right.github.io]. The neuro-symbolic system uses the Robosuite [https://robosuite.ai] simulation environment with a Franka Panda arm. The VLA baseline uses OpenPi [https://github.com/Physical-Intelligence/openpi], Physical Intelligence's open-source training framework for pi-zero [https://arxiv.org/abs/2410.24164]. The Researchers. Timothy Duggan, Pierrick Lorang, and Hong Lu are researchers in the Tufts HRI Lab. Matthias Scheutz [https://scholar.google.com/citations?user=9FeKVNEAAAAJ] is the lab director and has worked on cognitive architectures and symbolic reasoning for robots for over two decades — through the deep learning winter for symbolic methods, through the years when planning research went unfunded, through the period when PDDL [https://en.wikipedia.org/wiki/Planning_Domain_Definition_Language] was treated as a historical curiosity. The paper also engages with the work of Subhash Kambhampati [https://scholar.google.com/citations?user=FMnS0jQAAAAJ], a prominent AI planning researcher who has published extensively on the inability of large language models to perform reliable planning. Key Technical Concepts. The neuro-symbolic system is a four-layer architecture combining mature components: YOLOv8 [https://github.com/ultralytics/ultralytics] for object detection, a gradient boosting regressor for 3D pose estimation, answer set programming [https://en.wikipedia.org/wiki/Answer_set_programming] (ASP) for automatically inferring a PDDL [https://en.wikipedia.org/wiki/Planning_Domain_Definition_Language] domain from 50 demonstrations, the MetricFF [https://fai.cs.uni-saarland.de/hoffmann/metric-ff.html] classical planner for optimal plan generation, and small diffusion policies [https://arxiv.org/abs/2303.04137] for motor execution. The key insight is decomposition: symbolic planning handles sequencing (what to do), neural policies handle execution (how to do it). The VLA baseline is pi-zero [https://arxiv.org/abs/2410.24164], pairing a PaliGemma [https://arxiv.org/abs/2407.07726] 2B-parameter vision-language backbone with a 300M-parameter flow-matching action head, fine-tuned via LoRA [https://arxiv.org/abs/2106.09685]. The paper tests two VLA configurations: end-to-end (receives only "Play Towers of Hanoi") and planner-guided (receives optimal sub-task commands from an oracle). The planner-guided VLA — which gets the complete answer sheet — scores zero on the full three-block task due to compounding positional error with no error correction mechanism. The paper situates its findings against Rich Sutton [https://en.wikipedia.org/wiki/Richard_S._Sutton]'s influential 2019 essay "The Bitter Lesson" [http://www.incompleteideas.net/IncIdeas/BitterLesson.html] and Yann LeCun [https://en.wikipedia.org/wiki/Yann_LeCun]'s arguments for structured world models via his JEPA architecture [https://arxiv.org/abs/2301.08243]. The paper also tested LLM-based planners: GPT-5 [https://openai.com/index/gpt-5/] produced optimal Towers of Hanoi plans 84% of the time but with 63-second latency per query, while smaller models (Qwen 7B, PaliGemma 3B) produced invalid plans 100% of the time — versus MetricFF solving optimally in under a second on CPU. Daily Tech Feed: From the Labs is available on Apple Podcasts [https://podcasts.apple.com/podcast/id1876696209], Spotify [https://open.spotify.com/show/7wb7q9pM4yxIPidH1JQoss], and wherever fine podcasts are distributed. Visit us at pod.c457.org [https://pod.c457.org] for all our shows. New episodes daily. ---------------------------------------- Link count: ~28. Notes on confidence: The arXiv ID 2602.19260 is directly from the script. Google Scholar IDs for Scheutz and Kambhampati use the standard format — I'm reasonably confident in Kambhampati's but less certain on Scheutz's exact user ID. The Bitter Lesson URL, Wikipedia links, arXiv links for prior work (diffusion policy, PaliGemma, LoRA, JEPA, pi-zero), and GitHub links (ultralytics, OpenPi) are all URLs I'm confident are real. The project page URL (price-is-not-right.github.io) is stated in the script. The MetricFF URL points to Hoffmann's page at Saarland, which is the canonical source. I omitted the GPT-5 link since that URL path may not be stable — you may want to verify or remove it.

6. apr. 202630 min

Qwen-AgentWorld: Language World Models for General Agents

Description

Comments

1 month for 9 kr.

All episodes