Daily Tech Feed: From the Labs
QWEN-AGENTWORLD: LANGUAGE WORLD MODELS FOR GENERAL AGENTS Episode 0044 — DTF:FTL | Daily Tech Feed: From The Labs ---------------------------------------- WHAT THIS PAPER DOES Qwen-AgentWorld, from Alibaba's Qwen team, builds the missing half of the AI agent equation: a language world model — a system that predicts what happens next in an environment when an agent takes an action. Current AI agent research has focused almost entirely on the policy side: what action should the agent take? Qwen-AgentWorld addresses the complementary question: given the current state and an action, what is the next state? This is the world model. The paper argues, backed by a 2025 theoretical proof (Richens et al.), that any agent capable of generalizing across a broad range of tasks must have learned a world model. The result is two open-weight models — Qwen-AgentWorld-35B-A3B (released; 35B parameters, 3B active, Mixture-of-Experts) and Qwen-AgentWorld-397B-A17B (benchmark-evaluated) — capable of simulating seven categories of agent environments through long chain-of-thought reasoning. ---------------------------------------- THE SEVEN DOMAINS The model simulates all of the following within a single unified framework: * MCP (Model Context Protocol tool calls) * Search (web search and extraction) * Terminal (shell commands, bash) * SWE (software engineering: read/edit/bash workflows) * Android (touch/swipe/type on UI view hierarchies) * Web (click/navigate via accessibility trees) * OS (mouse/keyboard on desktop environments) For the three GUI domains, observations are represented as textual accessibility trees and UI view hierarchies rather than pixel frames — making them tractable for language model training. ---------------------------------------- HOW IT WAS TRAINED Three-stage pipeline — "CPT injects, SFT activates, RL sharpens": 1. Continual Pre-Training (CPT): Trained on 10M+ real-world interaction trajectories collected from three sources: a dedicated agent infrastructure running automated tasks across all seven domains, open-source interaction traces (terminal recordings, agentic tool-call logs), and in-house Alibaba agentic trajectories. CPT injects environment dynamics without chain-of-thought reasoning. 2. Supervised Fine-Tuning (SFT): Activates next-state prediction as an explicit thinking pattern — the model learns to reason through what the environment will return before generating its prediction. 3. Reinforcement Learning (RL): Sharpens fidelity with a hybrid reward system combining rubric-based scoring (open-ended quality dimensions) and rule-based verifiers (deterministic checks). Data pools across the three stages are strictly disjoint. The RL pool alone contains 92,308 trajectories averaging 13.4 turns each. ---------------------------------------- AGENTWORLDBENCH A new evaluation benchmark built from real environment interactions of five frontier models on nine established agent benchmarks, including Terminal-Bench 1.0 and 2.0, OSWorld-Verified, and others. Evaluation uses rubric judging across five dimensions. All eval trajectories are out-of-distribution for the trained models. AgentWorldBench results (overall score, higher is better): Model Overall Qwen-AgentWorld-397B-A17B 58.71 GPT-5.4 58.25 Claude Opus 4.6 57.80 Claude Opus 4.8 56.59 Claude Sonnet 4.6 56.04 Qwen-AgentWorld-35B-A3B 56.39 Qwen3.5-35B-A3B (no LWM) 47.73 The 35B model with LWM training shows a +8.66 point improvement over the same model without it. ---------------------------------------- TWO WAYS TO USE A WORLD MODEL PARADIGM 1: DECOUPLED ENVIRONMENT SIMULATOR Use the world model to simulate environments for agentic RL training, eliminating the need for real-environment access. Key results: * Generalizable simulation: Sim RL on 4,000 out-of-distribution OpenClaw environments yielded +4.3 on Claw-Eval and +7.1 on QwenClawBench vs. real-environment RL with a weaker simulator. * Controllable perturbations (MCP): Injecting targeted adversarial conditions (e.g., hidden answers, degraded tool responses) during training: +3.7 on Tool Decathlon, +12.3 on MCPMark. * Fictional-world construction (Search): Agents trained entirely in invented, self-consistent fictional search worlds: +16.29 on WideSearch F1 Item, +10.49 on WideSearch F1 Row — surpassing real-environment training. The fictional-world result is particularly striking. Self-consistency of the simulated world, not factual accuracy, is what matters for generalization. PARADIGM 2: UNIFIED AGENT FOUNDATION MODEL Use LWM training as a warm-up or auxiliary training stage before downstream agentic RL. The world model acquaints the agent with environment dynamics before it has to act. Agent performance gains (35B model, LWM RL warm-up vs. SFT baseline): Benchmark Baseline w/ LWM RL Gain Terminal-Bench 2.0 33.25 39.55 +6.30 SWE-Bench Verified 64.47 67.86 +3.39 SWE-Bench Pro 42.18 47.42 +5.24 WideSearch F1 Item 33.38 46.17 +12.79 Claw-Eval 53.60 64.88 +11.28 QwenClawBench 39.76 49.43 +9.67 BFCL v4 62.29 71.25 +8.96 Gains appear across in-domain and out-of-domain benchmarks. Three of the seven benchmarks are entirely outside the LWM training distribution. ---------------------------------------- WHY THIS MATTERS The open-weights angle: Qwen is an Alibaba project. The 35B-A3B model weights and AgentWorldBench dataset are publicly released on HuggingFace. A Chinese industrial lab releasing competitive open-weight models continues to compress the gap between proprietary frontier systems and what any researcher or developer can run. The simulation unlock: If you can simulate environments accurately enough to train real agents, you can scale RL training without scaling real-world compute infrastructure. Every shell command, every API call, every GUI tap becomes synthetically reproducible. The fictional-world result suggests the bar for "accurate enough" may be lower than expected — internal consistency matters more than ground truth. The missing piece argument: The theoretical backing (Richens et al. 2025: generalization requires world models) reframes this as a necessary research direction, not a nice-to-have. If that proof holds, world models are not optional. The open questions: Does this transfer to pixel-based environments? How does simulation fidelity degrade for rare or adversarial states? The 397B model is not publicly released — the benchmark-beating number comes from the closed model. ---------------------------------------- LINKS * Paper: https://arxiv.org/abs/2606.24597 * GitHub: https://github.com/QwenLM/Qwen-AgentWorld * Model weights (35B): https://huggingface.co/Qwen/Qwen-AgentWorld-35B-A3B * AgentWorldBench dataset: https://huggingface.co/datasets/Qwen/AgentWorldBench * Qwen blog post: https://qwen.ai/blog?id=qwen-agentworld * Richens et al. 2025 (world models are necessary): Referenced in paper section 1 * Terminal-Bench: Referenced benchmark (Merrill et al. 2026) * OSWorld-Verified: https://arxiv.org/abs/2404.07972 ---------------------------------------- AI disclosure: This episode script was written with AI assistance.
42 episodes
Comments
0Be the first to comment
Sign up now and become a member of the Daily Tech Feed: From the Labs community!