Imagen de portada del programa RoboPapers

RoboPapers

Podcast de Chris Paxton and Michael Cho

inglés

Tecnología y ciencia

Empieza 7 días de prueba

$99 / mes después de la prueba.Cancela cuando quieras.

  • 20 horas de audiolibros al mes
  • Podcasts solo en Podimo
  • Podcast gratuitos
Prueba gratis

Acerca de RoboPapers

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

Todos los episodios

81 episodios

episode Ep#81: mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs artwork

Ep#81: mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Robotics fundamentally involves understanding the dynamics of how things change in the world in response to action and force. This is impossible to learn from static images; instead, it’s far more effective and more data-efficient to learn from video. Elvis Nava joins us to talk about mimic-video and Mimic Robotics. Mimic-ivdeo is part of a new class of video-action models, capable of achieving complex, dexterous bimanual robotic manipulation with relatively little robot data. One of the key findings from mimic-video is that pretraining on webscale video allows robots to learn physics priors; as a result, policies train faster, generalize better, and are capable of more impressive dexterity, versus training on static images or image-language pairs as per a VLM. Watch Episode #81 of RoboPapers with Michael Cho and Chris Paxton to learn more! Abstract Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures. Learn More Project page: https://mimic-video.github.io/ [https://mimic-video.github.io/] ArXiV: https://arxiv.org/abs/2512.15692 [https://arxiv.org/abs/2512.15692] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com [https://robopapers.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

20 de may de 2026 - 1 h 8 min
episode Ep#80: LATENT: Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data artwork

Ep#80: LATENT: Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data

Sports like tennis are great examples of the sort of dynamic whole-body interaction that’s possible with humanoid robots. But capturing examples of fast, dynamic interactions from humans is really difficult. Enter LATENT, which uses lower-quality human data plus reinforcement learning to teach a robot to play tennis, able to complete back-and-forth volleys at a human level. LATENT has three steps: (1) collecting imperfect human data like a backswing, (2) using these to learn a latent action space, and (3) they train a high-level policy in simulation which can compose these actions and execute tennis skills on a robot. Haofei Lu [https://x.com/josh00_lu] and Yunrui Lian [https://x.com/LianYunrui] join us to tell us about their method. Watch Episode #80 of RoboPapers, with Chris Paxton and Jiafei Duan, now to learn more! Abstract Human athletes demonstrate versatile and highly-dynamic tennis skills to successfully conduct competitive rallies with a high-speed tennis ball. However, reproducing such behaviors on humanoid robots is difficult, partially due to the lack of perfect humanoid action data or human kinematic motion data in tennis scenarios as reference. In this work, we propose LATENT, a system that Learns Athletic humanoid TEnnis skills from imperfect human motioN daTa. The imperfect human motion data consist only of motion fragments that capture the primitive skills used when playing tennis rather than precise and complete human-tennis motion sequences from real-world tennis matches, thereby significantly reducing the difficulty of data collection. Our key insight is that, despite being imperfect, such quasi-realistic data still provide priors about human primitive skills in tennis scenarios. With further correction and composition, we learn a humanoid policy that can consistently strike incoming balls under a wide range of conditions and return them to target locations, while preserving natural motion styles. We also propose a series of designs for robust sim-to-real transfer and deploy our policy on the Unitree G1 humanoid robot. Our method achieves surprising results in the real world and can stably sustain multi-shot rallies with human players. Learn More Project page; https://zzk273.github.io/LATENT/ [https://zzk273.github.io/LATENT/] ArXiV: https://arxiv.org/pdf/2603.12686 [https://arxiv.org/pdf/2603.12686] Code: https://github.com/GalaxyGeneralRobotics/LATENT [https://github.com/GalaxyGeneralRobotics/LATENT] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com [https://robopapers.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

14 de may de 2026 - 55 min
episode Ep#79: Rhoda AI - Causal Video Models Are Data-Efficient Robot Policy Learners artwork

Ep#79: Rhoda AI - Causal Video Models Are Data-Efficient Robot Policy Learners

Training robot foundation models faces two key hurdles: how to get enough data to train an effective model, and how to make sure that new skills can be acquired quickly. The team at Rhoda AI believes that the answer is training Direct Video Action models from web data. Web data is plentiful, to the point where Rhoda can train their base model on hundreds of years of video data. And then, with the addition of robot data, they can quickly adapt it to new tasks with as little as 20 hours of in-domain data, performing complex, multi-step manipulation tasks with their purpose-built video foundation model. Tongzhou Mu [https://x.com/tongzhou_mu], Eric Chan [https://x.com/ericryanchan], and Changan Chen [https://x.com/changanvr] joined us to talk more about their approach. Watch Episode #79 of RoboPapers, with Michael Cho, Chris Paxton, and Jiafei Duan, to learn more! Learn More Blog post: https://www.rhoda.ai/research/direct-video-action [https://www.rhoda.ai/research/direct-video-action] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com [https://robopapers.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

6 de may de 2026 - 1 h 9 min
episode Ep#78: Three Eras of Robot Learning artwork

Ep#78: Three Eras of Robot Learning

Robotics has changed dramatically over the last eight years. Ted has been involved in the cutting edge of robot learning through this period, spending those eight years at Google Brain/Google Deepmind. And he’s identified three eras of robot learning. These eras are: * The Era of Existence Proofs - trying different methods like QT-Opt, on-robot RL * The Era of Foundation Models - transitioning to data collection and clean objectives (i.e. supervised learning) * The Era of Scaling - orders of magnitude more data and larger models, enabling reasoning, long-horizon actions, and cross-embodiment transfer The only reason something succeeds is if everything goes right. Behavior cloning, for example, seemed stuck at 60-70% success rate on key tasks until his team rewrote their learning stack — at which point it hit 95-99%+ success rates. For most of those eight years, something was wrong. The stack wasn’t quite right, the learning algorithms were wrong, the data didn’t exist. Hardware and operations are not mature enough. But they kept working on these problems, over and over, until finally they have arrived at amazing breakthrough. Some key trends now: * Reasoning models for robotics * Long-horizon, precision-oriented tasks, like making coffee from Physical Intelligence or GPU assembly from Skild * Cross-embodiment transfer * Hardware and model co-design * Results are nice, but capabilities are even more — and academics are going to have trouble keeping up with compute and resources available to companies Watch Episode 78 of RoboPapers, with Michael Cho and Jiafei Duan, to learn more! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com [https://robopapers.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

5 de may de 2026 - 1 h 10 min
episode Ep#77: DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos artwork

Ep#77: DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

World models have many different uses, from evaluation to training data generation to robot planning. DreamDojo is a new foundation world model that allows for impressively general and long-horizon interaction, generating coherent videos for interaction sequences over a minute long. It works in a wide range of environments and even generalizes to previously-unseen environments. We talked to Shenyuan Gao and William Liang about how they built DreamDojo, and about what tricks were necessary to scale world model learning on data with sparse action labels, pretraining on 44,000 hours of human data and adapting to a wide variety of robots, environments, and skills. Watch Epsiode #77 of RoboPapers with Michael Cho and Chris Paxton now to learn more! Abstract Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models. Learn More Project Page: https://dreamdojo-world.github.io/ [https://dreamdojo-world.github.io/] ArXiV: https://arxiv.org/abs/2602.06949 [https://arxiv.org/abs/2602.06949] Github: https://github.com/NVIDIA/DreamDojo [https://github.com/NVIDIA/DreamDojo] Original thread on X [https://x.com/ShenyuanGao/status/2024898256334114876] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com [https://robopapers.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

29 de abr de 2026 - 1 h 2 min
Muy buenos Podcasts , entretenido y con historias educativas y divertidas depende de lo que cada uno busque. Yo lo suelo usar en el trabajo ya que estoy muchas horas y necesito cancelar el ruido de al rededor , Auriculares y a disfrutar ..!!
Muy buenos Podcasts , entretenido y con historias educativas y divertidas depende de lo que cada uno busque. Yo lo suelo usar en el trabajo ya que estoy muchas horas y necesito cancelar el ruido de al rededor , Auriculares y a disfrutar ..!!
Fantástica aplicación. Yo solo uso los podcast. Por un precio módico los tienes variados y cada vez más.
Me encanta la app, concentra los mejores podcast y bueno ya era ora de pagarles a todos estos creadores de contenido

Elige tu suscripción

Más populares

Premium

20 horas de audiolibros

  • Podcasts solo en Podimo

  • Disfruta los shows de Podimo sin anuncios

  • Cancela cuando quieras

Empieza 7 días de prueba
Después $99 / mes

Prueba gratis

Sólo en Podimo

Audiolibros populares

Prueba gratis

Empieza 7 días de prueba. $99 / mes después de la prueba. Cancela cuando quieras.