Ep#83: PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation

Descripción

Spatial understanding is important to moving around in complex environments and is a huge part of the challenge of generalizing to new scenes. Most world models, however, largely ignore this spatial dimension, focusing on 2D images. Not PointWorld, though. PointWorld is a 3D world model trained from real and simulated data which can perform a wide variety of manipulation tasks on a real robot, including grasping or handling articulated objects, all without any additional fine tuning. Wenlong Huang joins us to tell us more about what makes this work and how it’s different from other world models. Watch Episode #83 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more! Abstract Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single-arm Franka and a bimanual humanoid. Through rigorous, large-scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large-scale 3D world modeling. With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation. We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post-training and all from a single image captured in-the-wild. References Project page: https://point-world.github.io/ [https://point-world.github.io/] ArXiV: https://arxiv.org/abs/2601.03782 [https://arxiv.org/abs/2601.03782] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com [https://robopapers.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

Ep#87: MolmoAct 2: An open foundation for robots that work in the real world

There are few truly open models in the world, including both weights and data. However, these models are crucial for research and development of new systems — they help us learn which data is important and help develop new capabilities for deploying robots in the real world. MolmoAct2 provides a foundation for open research into robotics. It is associated with its own open dataset, an open-data action tokenizer, and a reasoning variant which predicts depth tokens. And people have actually been using it across the community, running experiments in their own labs or homes. Haoquan Fang and Jiafei Duan tell us more. Watch Episode 87 of RoboPapers, with Michael Cho and Chris Paxton, now! Abstract Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today’s systems fall short for real-world deployment. Frontier models are closed; open-weight alternatives are tied to expensive hardware; reasoning-augmented policies pay prohibitive latency for their grounding; and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor, MolmoAct along five axes. (1) MolmoAct2 is built on top of our new Molmo2-ER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. (2) We release three new robot datasets spanning low-to-medium cost platforms: MolmoAct2-BimanualYAM Dataset, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date; MolmoAct2-DROID Dataset, a quality-filtered Franka subset of DROID; and MolmoAct2-SO100/101 Dataset, a quality-filtered SO-100/101 subset. (3) We train and release MolmoAct2-FAST Tokenizer, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. (4) We design a new VLA architecture to graft the discrete-token VLM into the flow-matching continuous-action expert via per-layer key-value (KV) conditioning. (5) we propose MolmoAct2-Think, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including π0.5, while Molmo2-ER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Learn More Project page: https://allenai.org/blog/molmoact2 [https://allenai.org/blog/molmoact2] Code: https://github.com/allenai/molmoact2 [https://github.com/allenai/molmoact2] ArXiV: https://arxiv.org/pdf/2605.02881v1 [https://arxiv.org/pdf/2605.02881v1] And check out our episode on the original MolmoAct: This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com [https://robopapers.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

18 de jun de 20261 h 2 min

Ep#83: PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation

Descripción

Comentarios

2 meses por 1 €

Todos los episodios