Research
Robotic Transformer 2 (RT-2) is a new vision-language-action (VLA) model that learns from both web and robotics data and translates that knowledge into generalized instructions for robotic control.
High-capacity vision-language models (VLMs) are trained on web-scale datasets, making these systems remarkably effective at recognizing visual or linguistic patterns and operating in different languages. But for robots to reach a similar level of skill, they would need to collect first-hand data about every object, environment, task and situation.
In our paperWe introduce Robotic Transformer 2 (RT-2), a new vision-language-action (VLA) model that learns from web and robotics data, and translates this knowledge into generalized instructions for robotic control, while retaining capabilities to web scale.
A visual language model (VLM) pre-trained on web-scale data learns from RT-1 robotic data to become RT-2, a visual action-language model (VLA) capable of controlling a robot.
This work builds on Robotic Transformer 1 (RT-1), a model trained on multi-tasking demonstrations, which can learn combinations of tasks and objects seen in robotic data. Specifically, our work used RT-1 robot demonstration data collected with 13 robots over 17 months in an office kitchen environment.
RT-2 exhibits enhanced generalization abilities and semantic and visual understanding beyond the robotic data to which it has been exposed. This includes interpreting new commands and responding to user commands by performing rudimentary reasoning, such as reasoning about object categories or high-level descriptions.
We also show that integrating chain-of-thought reasoning allows RT-2 to perform multi-step semantic reasoning, such as deciding which object could be used as an improvised hammer (a rock) or which type of drink is best suited to a tired person. (an energy drink).
Adaptation of VLMs for robotic control
RT-2 relies on VLMs that take one or more images as input and produce a sequence of tokens that conventionally represent natural language text. Such VLMs have been successfully trained on web-scale data to perform tasks such as visually answering questions, captioning images, or recognizing objects. In our work, we adapt the Pathways Language and Image model (PaLI-X) and the embedded Pathways language model (Webbed) to serve as the backbone of RT-2.
To control a robot, it must be trained to produce actions. We address this challenge by representing actions as tokens in the model output (similar to language tokens) and describe actions as strings that can be processed by the standard. natural language tokenizersshown here:
Representation of an action chain used in RT-2 training. An example of such a string could be a sequence of robot action token numbers, for example “1 128 91 241 5 101 127 217”.
The chain begins with a flag that indicates whether to continue or end the current episode, without executing subsequent commands, and follows with commands to change the position and rotation of the end effector, as well as extension desired from the robot gripper.
We use the same discretized version of the robot's actions as in RT-1 and show that converting it to a string representation allows training VLM models on robotic data – because the input and output spaces of such models do not need to be modified.
RT-2 architecture and training: We co-fine tune a pre-trained VLM model on robotics and web data. The resulting model takes into account images from the robot's camera and directly predicts the actions a robot will need to perform.
Generalization and emerging skills
We have carried out a series of qualitative and quantitative experiments on our RT-2 models, over 6,000 robotic tests. In exploring the RT-2's emerging capabilities, we first looked for tasks that would require combining knowledge from web-scale data and bot experience, then defined three categories of skills: understanding symbols, reasoning and human recognition.
Each task required understanding of visual-semantic concepts and the ability to perform robotic control to operate on these concepts. Commands such as “pick up the bag about to fall from the table” or “move the banana to the sum of two plus one” – where the robot is asked to perform a manipulation task on objects or scenarios never before seen in robotic data – are mandatory. knowledge translated from web data to work.
Examples of emerging robotics skills that are not present in robotics data and require knowledge transfer from web-based pre-training.
Across all categories, we observed an increase in generalization performance (over 3x improvement) compared to previous benchmarks, such as previous RT-1 models and models like Visual Cortex (VC-1), which were pre-trained on large visual datasets.
Emerging Skills Assessment Pass Rates: Our RT-2 models outperform previous Robotic Transformer (RT-1) and Visual Pre-Training (VC-1) benchmarks.
We also performed a series of quantitative evaluations, starting with the original RT-1 tasks, for which we have examples in the robot data, and continuing with varying degrees of objects, backgrounds, and d environments previously invisible to the robot that required the robot to learn generalization through VLM pre-training.
Examples of new robot environments, where RT-2 generalizes to new situations.
RT-2 retained performance on the original tasks seen in the robot data and improved performance on scenarios new to the robot, from 32% to 62% for RT-1, demonstrating the considerable advantage of pre- large-scale training.
Additionally, we observed significant improvements over pre-trained baselines on visual-only tasks, such as VC-1 and reusable representations for robotic manipulation (R3M) and algorithms that use VLMs for object identification, such as open-world object manipulation (MOOW).
RT-2 achieves high performance on in-distribution visible tasks and outperforms several baselines on non-distribution invisible tasks.
Evaluate our model on open source Language table suite of robotic tasks, we achieved a 90% success rate in simulation, a substantial improvement over previous benchmarks, including BC-Z (72%), RT-1 (74%), and WASH (77%).
Next, we evaluated the same model in the real world (since it was trained on simulations and real data) and demonstrated its ability to generalize to new objects, as shown below, where none of the objects except for the blue cube was present in the formation. database.
RT-2 performs well on real robot language table tasks. None of the objects except the blue cube were present in the training data.
Inspired by chain-of-thought prompting methods used in LLMsWe tested our models to combine robotic control with chain-of-thought reasoning to enable learning of long-term planning and low-level skills within a single model.
In particular, we refined a variant of RT-2 over a few hundred gradient steps in order to increase its ability to use language and actions jointly. Next, we augmented the data to include an additional “Plan” step, first describing the goal of the action the robot is about to take in natural language, followed by “Action” and the action tokens. action. Here we show an example of such reasoning and the resulting robot behavior:
Chain-of-thought reasoning allows learning an autonomous model capable of both planning long-term skill sequences and predicting robot actions.
Through this process, RT-2 can execute more complex commands that require reasoning about the intermediate steps necessary to complete a user instruction. Thanks to its VLM skeleton, RT-2 can also plan from image and text commands, enabling visually grounded planning, whereas current planning and action approaches like SayCan cannot see the real world and rely entirely on language.
Advancing robotic control
RT-2 shows that vision-language models (VLM) can be transformed into powerful vision-language-action (VLA) models, capable of directly controlling a robot by combining VLM pre-training with robotic data.
With two VLA instantiations based on PaLM-E and PaLI-X, RT-2 results in highly improved robotics policies and, more importantly, leads to significantly better generalization performance and emergent capabilities inherited from the pre-language vision at web scale. -training.
RT-2 is not only a simple and effective modification over existing VLM models, but also shows the promise of building a general-purpose physical robot capable of reasoning, problem-solving, and interpreting information to perform a wide range of tasks in reality. world.