RT-2: The new model translates vision and language into action

Research

Contents

Adaptation of VLMs for robotic control Generalization and emerging skills Advancing robotic control Thanks

Published: July 28, 2023
Authors: Yevgen Chebotar, Tianhe Yu

Robotic arm picking up a toy dinosaur from a diverse array of toys, food items and items on display on a table.

Robotic Transformer 2 (RT-2) is a new vision-language-action (VLA) model that learns from both web and robotics data and translates that knowledge into generalized instructions for robotic control.

High-capacity vision-language models (VLMs) are trained on web-scale datasets, making these systems remarkably effective at recognizing visual or linguistic patterns and operating in different languages. But for robots to reach a similar level of skill, they would need to collect first-hand data about every object, environment, task and situation.

In our paperWe introduce Robotic Transformer 2 (RT-2), a new vision-language-action (VLA) model that learns from web and robotics data, and translates this knowledge into generalized instructions for robotic control, while retaining capabilities to web scale.

A visual language model (VLM) pre-trained on web-scale data learns from RT-1 robotic data to become RT-2, a visual action-language model (VLA) capable of controlling a robot.

This work builds on Robotic Transformer 1 (RT-1), a model trained on multi-tasking demonstrations, which can learn combinations of tasks and objects seen in robotic data. Specifically, our work used RT-1 robot demonstration data collected with 13 robots over 17 months in an office kitchen environment.

RT-2 exhibits enhanced generalization abilities and semantic and visual understanding beyond the robotic data to which it has been exposed. This includes interpreting new commands and responding to user commands by performing rudimentary reasoning, such as reasoning about object categories or high-level descriptions.

We also show that integrating chain-of-thought reasoning allows RT-2 to perform multi-step semantic reasoning, such as deciding which object could be used as an improvised hammer (a rock) or which type of drink is best suited to a tired person. (an energy drink).

Adaptation of VLMs for robotic control

RT-2 relies on VLMs that take one or more images as input and produce a sequence of tokens that conventionally represent natural language text. Such VLMs have been successfully trained on web-scale data to perform tasks such as visually answering questions, captioning images, or recognizing objects. In our work, we adapt the Pathways Language and Image model (PaLI-X) and the embedded Pathways language model (Webbed) to serve as the backbone of RT-2.

To control a robot, it must be trained to produce actions. We address this challenge by representing actions as tokens in the model output (similar to language tokens) and describe actions as strings that can be processed by the standard. natural language tokenizersshown here:

Representation of an action chain used in RT-2 training. An example of such a string could be a sequence of robot action token numbers, for example “1 128 91 241 5 101 127 217”.

The chain begins with a flag that indicates whether to continue or end the current episode, without executing subsequent commands, and follows with commands to change the position and rotation of the end effector, as well as extension desired from the robot gripper.

We use the same discretized version of the robot's actions as in RT-1 and show that converting it to a string representation allows training VLM models on robotic data – because the input and output spaces of such models do not need to be modified.

RT-2 architecture and training: We co-fine tune a pre-trained VLM model on robotics and web data. The resulting model takes into account images from the robot's camera and directly predicts the actions a robot will need to perform.

Generalization and emerging skills

We have carried out a series of qualitative and quantitative experiments on our RT-2 models, over 6,000 robotic tests. In exploring the RT-2's emerging capabilities, we first looked for tasks that would require combining knowledge from web-scale data and bot experience, then defined three categories of skills: understanding symbols, reasoning and human recognition.

Each task required understanding of visual-semantic concepts and the ability to perform robotic control to operate on these concepts. Commands such as “pick up the bag about to fall from the table” or “move the banana to the sum of two plus one” – where the robot is asked to perform a manipulation task on objects or scenarios never before seen in robotic data – are mandatory. knowledge translated from web data to work.

Examples of emerging robotics skills that are not present in robotics data and require knowledge transfer from web-based pre-training.

Across all categories, we observed an increase in generalization performance (over 3x improvement) compared to previous benchmarks, such as previous RT-1 models and models like Visual Cortex (VC-1), which were pre-trained on large visual datasets.

Emerging Skills Assessment Pass Rates: Our RT-2 models outperform previous Robotic Transformer (RT-1) and Visual Pre-Training (VC-1) benchmarks.

We also performed a series of quantitative evaluations, starting with the original RT-1 tasks, for which we have examples in the robot data, and continuing with varying degrees of objects, backgrounds, and d environments previously invisible to the robot that required the robot to learn generalization through VLM pre-training.

Examples of new robot environments, where RT-2 generalizes to new situations.

RT-2 retained performance on the original tasks seen in the robot data and improved performance on scenarios new to the robot, from 32% to 62% for RT-1, demonstrating the considerable advantage of pre- large-scale training.

Additionally, we observed significant improvements over pre-trained baselines on visual-only tasks, such as VC-1 and reusable representations for robotic manipulation (R3M) and algorithms that use VLMs for object identification, such as open-world object manipulation (MOOW).

RT-2 achieves high performance on in-distribution visible tasks and outperforms several baselines on non-distribution invisible tasks.

Evaluate our model on open source Language table suite of robotic tasks, we achieved a 90% success rate in simulation, a substantial improvement over previous benchmarks, including BC-Z (72%), RT-1 (74%), and WASH (77%).

Next, we evaluated the same model in the real world (since it was trained on simulations and real data) and demonstrated its ability to generalize to new objects, as shown below, where none of the objects except for the blue cube was present in the formation. database.

RT-2 performs well on real robot language table tasks. None of the objects except the blue cube were present in the training data.

Inspired by chain-of-thought prompting methods used in LLMsWe tested our models to combine robotic control with chain-of-thought reasoning to enable learning of long-term planning and low-level skills within a single model.

In particular, we refined a variant of RT-2 over a few hundred gradient steps in order to increase its ability to use language and actions jointly. Next, we augmented the data to include an additional “Plan” step, first describing the goal of the action the robot is about to take in natural language, followed by “Action” and the action tokens. action. Here we show an example of such reasoning and the resulting robot behavior:

Chain-of-thought reasoning allows learning an autonomous model capable of both planning long-term skill sequences and predicting robot actions.

Through this process, RT-2 can execute more complex commands that require reasoning about the intermediate steps necessary to complete a user instruction. Thanks to its VLM skeleton, RT-2 can also plan from image and text commands, enabling visually grounded planning, whereas current planning and action approaches like SayCan cannot see the real world and rely entirely on language.

Advancing robotic control

RT-2 shows that vision-language models (VLM) can be transformed into powerful vision-language-action (VLA) models, capable of directly controlling a robot by combining VLM pre-training with robotic data.

With two VLA instantiations based on PaLM-E and PaLI-X, RT-2 results in highly improved robotics policies and, more importantly, leads to significantly better generalization performance and emergent capabilities inherited from the pre-language vision at web scale. -training.

RT-2 is not only a simple and effective modification over existing VLM models, but also shows the promise of building a general-purpose physical robot capable of reasoning, problem-solving, and interpreting information to perform a wide range of tasks in reality. world.

Thanks

We would like to thank the co-authors of this work: Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid , Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu and Brianna Zitkovich for their contributions to the project and Fred Alcober, Jodi Lynn Andres, Carolina Parada, Joseph Dabis, Rochelle Dela Cruz, Jessica Gomez, Gavin Gonzalez, John Guilyard, Tomas Jackson, Jie Tan, Scott Lehrer, Dee M, Utsav Malla, Sarah Nguyen, Jane Park, Emily Perez, Elio Prado, Jornell Quiambao, Clayton Tan, Jodexty Therlonge, Eleanor Tomlinson, Wenxuan Zhou and the great Google DeepMind team for their help and feedback.

RT-2: The new model translates vision and language into action

Adaptation of VLMs for robotic control

Generalization and emerging skills

Advancing robotic control

Leave a Reply Cancel reply

Stay Connected

Create an Amazing Newspaper

Latest News

Can Dogecoin Repeat Its 18,000% Rise of 2021? Here’s What the Chart Says

Shiba Inu (SHIB) Poised to Surge 20%? XRP EMA Cross Surprises, Toncoin (TON) Hits Key Milestone

Federal authorities are warning of scammers who use couriers to collect cash and gold from their victims, many of whom are elderly.

Government announces £100m for quantum research centres

Subscribe to our newsletter

Adaptation of VLMs for robotic control

Generalization and emerging skills

Advancing robotic control

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Create an Amazing Newspaper

Latest News

Can Dogecoin Repeat Its 18,000% Rise of 2021? Here’s What the Chart Says

Shiba Inu (SHIB) Poised to Surge 20%? XRP EMA Cross Surprises, Toncoin (TON) Hits Key Milestone

Federal authorities are warning of scammers who use couriers to collect cash and gold from their victims, many of whom are elderly.

Government announces £100m for quantum research centres

Subscribe to our newsletter