RoboPapers

RoboPapers

Ep#78: Three Eras of Robot Learning

1 h 10 min · 5 de may de 2026
Portada del episodio Ep#78: Three Eras of Robot Learning

Descripción

Robotics has changed dramatically over the last eight years. Ted has been involved in the cutting edge of robot learning through this period, spending those eight years at Google Brain/Google Deepmind. And he’s identified three eras of robot learning. These eras are: * The Era of Existence Proofs - trying different methods like QT-Opt, on-robot RL * The Era of Foundation Models - transitioning to data collection and clean objectives (i.e. supervised learning) * The Era of Scaling - orders of magnitude more data and larger models, enabling reasoning, long-horizon actions, and cross-embodiment transfer The only reason something succeeds is if everything goes right. Behavior cloning, for example, seemed stuck at 60-70% success rate on key tasks until his team rewrote their learning stack — at which point it hit 95-99%+ success rates. For most of those eight years, something was wrong. The stack wasn’t quite right, the learning algorithms were wrong, the data didn’t exist. Hardware and operations are not mature enough. But they kept working on these problems, over and over, until finally they have arrived at amazing breakthrough. Some key trends now: * Reasoning models for robotics * Long-horizon, precision-oriented tasks, like making coffee from Physical Intelligence or GPU assembly from Skild * Cross-embodiment transfer * Hardware and model co-design * Results are nice, but capabilities are even more — and academics are going to have trouble keeping up with compute and resources available to companies Watch Episode 78 of RoboPapers, with Michael Cho and Jiafei Duan, to learn more! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com [https://robopapers.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

Comentarios

0

Sé la primera persona en comentar

¡Regístrate ahora y únete a la comunidad de RoboPapers!

Prueba gratis

Empieza 7 días de prueba

$99 / mes después de la prueba. · Cancela cuando quieras.

  • Podcasts solo en Podimo
  • 20 horas de audiolibros al mes
  • Podcast gratuitos

Todos los episodios

83 episodios

episode Ep#83: PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation artwork

Ep#83: PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation

Spatial understanding is important to moving around in complex environments and is a huge part of the challenge of generalizing to new scenes. Most world models, however, largely ignore this spatial dimension, focusing on 2D images. Not PointWorld, though. PointWorld is a 3D world model trained from real and simulated data which can perform a wide variety of manipulation tasks on a real robot, including grasping or handling articulated objects, all without any additional fine tuning. Wenlong Huang joins us to tell us more about what makes this work and how it’s different from other world models. Watch Episode #83 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more! Abstract Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single-arm Franka and a bimanual humanoid. Through rigorous, large-scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large-scale 3D world modeling. With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation. We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post-training and all from a single image captured in-the-wild. References Project page: https://point-world.github.io/ [https://point-world.github.io/] ArXiV: https://arxiv.org/abs/2601.03782 [https://arxiv.org/abs/2601.03782] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com [https://robopapers.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

29 de may de 20261 h 22 min
episode Ep#82: SimTooReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation artwork

Ep#82: SimTooReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

Humans use tools to perform almost all of the physical work that we do from day to day. However, tools come in many different sizes and shapes, and it’s very difficult to collect human data for them in general. What about training in simulation? SimTooReal aims to address this through, unsurprisingly, sim-to-real learning. Kushal Kedia [https://x.com/kushalk_] and Tyler Lum [https://x.com/tylerlum23] talk about how it works: they procedurally generate tool-like objects, and then train with the universal objective of moving objects around to different locations. This creates a general-purpose model which can manipulate various tools to perform a variety of tasks in the real world. Watch episode #82 of RoboPapers, hosted by Michael Cho and Jiafei Duan, now to learn more! Abstract The ability to manipulate tools significantly expands the set of tasks a robot can perform. Yet, tool manipulation represents a challenging class of dexterity, requiring grasping thin objects, in-hand object rotations, and forceful interactions. Since collecting teleoperation data for these behaviors is challenging, sim-to-real reinforcement learning (RL) is a promising alternative. However, prior approaches typically require substantial engineering effort to model objects and tune reward functions for each task. In this work, we propose SimToolReal, taking a step towards generalizing sim-to-real RL policies for tool manipulation. Instead of focusing on a single object and task, we procedurally generate a large variety of tool-like object primitives in simulation and train a single RL policy with the universal goal of manipulating each object to random goal poses. This approach enables SimToolReal to perform general dexterous tool manipulation at test-time without any object or task-specific training. We demonstrate that SimToolReal outperforms prior retargeting and fixed-grasp methods by 37% while matching the performance of specialist RL policies trained on specific target objects and tasks. Finally, we show that SimToolReal generalizes across a diverse set of everyday tools, achieving strong zero-shot performance over 120 real-world rollouts spanning 24 tasks, 12 object instances, and 6 tool categories. Learn More Project page: https://simtoolreal.github.io/ [https://simtoolreal.github.io/] ArXiV: https://arxiv.org/abs/2602.16863 [https://arxiv.org/abs/2602.16863] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com [https://robopapers.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

27 de may de 202654 min
episode Ep#81: mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs artwork

Ep#81: mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Robotics fundamentally involves understanding the dynamics of how things change in the world in response to action and force. This is impossible to learn from static images; instead, it’s far more effective and more data-efficient to learn from video. Elvis Nava joins us to talk about mimic-video and Mimic Robotics. Mimic-ivdeo is part of a new class of video-action models, capable of achieving complex, dexterous bimanual robotic manipulation with relatively little robot data. One of the key findings from mimic-video is that pretraining on webscale video allows robots to learn physics priors; as a result, policies train faster, generalize better, and are capable of more impressive dexterity, versus training on static images or image-language pairs as per a VLM. Watch Episode #81 of RoboPapers with Michael Cho and Chris Paxton to learn more! Abstract Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures. Learn More Project page: https://mimic-video.github.io/ [https://mimic-video.github.io/] ArXiV: https://arxiv.org/abs/2512.15692 [https://arxiv.org/abs/2512.15692] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com [https://robopapers.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

20 de may de 20261 h 8 min
episode Ep#80: LATENT: Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data artwork

Ep#80: LATENT: Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data

Sports like tennis are great examples of the sort of dynamic whole-body interaction that’s possible with humanoid robots. But capturing examples of fast, dynamic interactions from humans is really difficult. Enter LATENT, which uses lower-quality human data plus reinforcement learning to teach a robot to play tennis, able to complete back-and-forth volleys at a human level. LATENT has three steps: (1) collecting imperfect human data like a backswing, (2) using these to learn a latent action space, and (3) they train a high-level policy in simulation which can compose these actions and execute tennis skills on a robot. Haofei Lu [https://x.com/josh00_lu] and Yunrui Lian [https://x.com/LianYunrui] join us to tell us about their method. Watch Episode #80 of RoboPapers, with Chris Paxton and Jiafei Duan, now to learn more! Abstract Human athletes demonstrate versatile and highly-dynamic tennis skills to successfully conduct competitive rallies with a high-speed tennis ball. However, reproducing such behaviors on humanoid robots is difficult, partially due to the lack of perfect humanoid action data or human kinematic motion data in tennis scenarios as reference. In this work, we propose LATENT, a system that Learns Athletic humanoid TEnnis skills from imperfect human motioN daTa. The imperfect human motion data consist only of motion fragments that capture the primitive skills used when playing tennis rather than precise and complete human-tennis motion sequences from real-world tennis matches, thereby significantly reducing the difficulty of data collection. Our key insight is that, despite being imperfect, such quasi-realistic data still provide priors about human primitive skills in tennis scenarios. With further correction and composition, we learn a humanoid policy that can consistently strike incoming balls under a wide range of conditions and return them to target locations, while preserving natural motion styles. We also propose a series of designs for robust sim-to-real transfer and deploy our policy on the Unitree G1 humanoid robot. Our method achieves surprising results in the real world and can stably sustain multi-shot rallies with human players. Learn More Project page; https://zzk273.github.io/LATENT/ [https://zzk273.github.io/LATENT/] ArXiV: https://arxiv.org/pdf/2603.12686 [https://arxiv.org/pdf/2603.12686] Code: https://github.com/GalaxyGeneralRobotics/LATENT [https://github.com/GalaxyGeneralRobotics/LATENT] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com [https://robopapers.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

14 de may de 202655 min
episode Ep#79: Rhoda AI - Causal Video Models Are Data-Efficient Robot Policy Learners artwork

Ep#79: Rhoda AI - Causal Video Models Are Data-Efficient Robot Policy Learners

Training robot foundation models faces two key hurdles: how to get enough data to train an effective model, and how to make sure that new skills can be acquired quickly. The team at Rhoda AI believes that the answer is training Direct Video Action models from web data. Web data is plentiful, to the point where Rhoda can train their base model on hundreds of years of video data. And then, with the addition of robot data, they can quickly adapt it to new tasks with as little as 20 hours of in-domain data, performing complex, multi-step manipulation tasks with their purpose-built video foundation model. Tongzhou Mu [https://x.com/tongzhou_mu], Eric Chan [https://x.com/ericryanchan], and Changan Chen [https://x.com/changanvr] joined us to talk more about their approach. Watch Episode #79 of RoboPapers, with Michael Cho, Chris Paxton, and Jiafei Duan, to learn more! Learn More Blog post: https://www.rhoda.ai/research/direct-video-action [https://www.rhoda.ai/research/direct-video-action] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com [https://robopapers.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

6 de may de 20261 h 9 min