Ep#78: Three Eras of Robot Learning

Descripción

Robotics has changed dramatically over the last eight years. Ted has been involved in the cutting edge of robot learning through this period, spending those eight years at Google Brain/Google Deepmind. And he’s identified three eras of robot learning. These eras are: * The Era of Existence Proofs - trying different methods like QT-Opt, on-robot RL * The Era of Foundation Models - transitioning to data collection and clean objectives (i.e. supervised learning) * The Era of Scaling - orders of magnitude more data and larger models, enabling reasoning, long-horizon actions, and cross-embodiment transfer The only reason something succeeds is if everything goes right. Behavior cloning, for example, seemed stuck at 60-70% success rate on key tasks until his team rewrote their learning stack — at which point it hit 95-99%+ success rates. For most of those eight years, something was wrong. The stack wasn’t quite right, the learning algorithms were wrong, the data didn’t exist. Hardware and operations are not mature enough. But they kept working on these problems, over and over, until finally they have arrived at amazing breakthrough. Some key trends now: * Reasoning models for robotics * Long-horizon, precision-oriented tasks, like making coffee from Physical Intelligence or GPU assembly from Skild * Cross-embodiment transfer * Hardware and model co-design * Results are nice, but capabilities are even more — and academics are going to have trouble keeping up with compute and resources available to companies Watch Episode 78 of RoboPapers, with Michael Cho and Jiafei Duan, to learn more! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com [https://robopapers.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

Ep#83: PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation

Spatial understanding is important to moving around in complex environments and is a huge part of the challenge of generalizing to new scenes. Most world models, however, largely ignore this spatial dimension, focusing on 2D images. Not PointWorld, though. PointWorld is a 3D world model trained from real and simulated data which can perform a wide variety of manipulation tasks on a real robot, including grasping or handling articulated objects, all without any additional fine tuning. Wenlong Huang joins us to tell us more about what makes this work and how it’s different from other world models. Watch Episode #83 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more! Abstract Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single-arm Franka and a bimanual humanoid. Through rigorous, large-scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large-scale 3D world modeling. With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation. We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post-training and all from a single image captured in-the-wild. References Project page: https://point-world.github.io/ [https://point-world.github.io/] ArXiV: https://arxiv.org/abs/2601.03782 [https://arxiv.org/abs/2601.03782] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com [https://robopapers.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

29 de may de 20261 h 22 min

Ep#78: Three Eras of Robot Learning

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios