One AI Model to Rule All Robots

Artificial intelligence, Embodied intelligence, Machine learning, Quadcopters, Quadruped robots, robotic arm, Robotics

The software used to control a robot is normally highly adapted to its specific physical set up. But now researchers have created a single general-purpose robotic control policy that can operate robotic arms, wheeled robots, quadrupeds, and even drones.

One of the biggest challenges when it comes to applying machine learning to robotics is the paucity of data. While computer vision and natural language processing can piggyback off the vast quantities of image and text data found on the Internet, collecting robot data is costly and time-consuming.

To get around this, there have been growing efforts to pool data collected by different groups on different kinds of robots, including the Open X-Embodiment and DROID datasets. The hope is that training on diverse robotics data will lead to “positive transfer,” which refers to when skills learned from training on one task help to boost performance on another.

The problem is that robots often have very different embodiments—a term used to describe their physical layout and suite of sensors and actuators—so the data they collect can vary significantly. For instance, a robotic arm might be static, have a complex arrangement of joints and fingers, and collect video from a camera on its wrist. In contrast, a quadruped robot is regularly on the move and relies on force feedback from its legs to maneuver. The kinds of tasks and actions these machines are trained to carry out are also diverse: The arm may pick and place objects, while the quadruped needs keen navigation.

That makes training a single AI model on these large collections of data challenging, says Homer Walke, a Ph.D. student at the University of California, Berkeley. So far, most attempts have either focused on data from a narrower selection of similar robots or researchers have manually tweaked data to make observations from different robots more similar. But in research to be presented at the Conference on Robot Learning (CoRL) in Munich in November, they unveiled a new model called CrossFormer that can train on data from a diverse set of robots and control them just as well as specialized control policies.

“We want to be able to train on all of this data to get the most capable robot,” says Walke. “The main advance in this paper is working out what kind of architecture works the best for accommodating all these varying inputs and outputs.”

How to control diverse robots with the same AI model

The team used the same model architecture that powers large language model, known as a transformer. In many ways, the challenge the researchers were trying to solve is not dissimilar to that facing a chatbot, says Walke. In language modeling, the AI has to to pick out similar patterns in sentences with different lengths and word orders. Robot data can also be arranged in a sequence much like a written sentence, but depending on the particular embodiment, observations and actions vary in length and order too.

“Words might appear in different locations in a sentence, but they still mean the same thing,” says Walke. “In our task, an observation image might appear in different locations in the sequence, but it’s still fundamentally an image and we still want to treat it like an image.”


UC Berkeley/Carnegie Mellon University

Most machine learning approaches work through a sequence one element at a time, but transformers can process the entire stream of data at once. This allows them to analyze the relationship between different elements and makes them better at handling sequences that are not standardized, much like the diverse data found in large robotics datasets.

Walke and his colleagues aren’t the first to train transformers on large-scale robotics data. But previous approaches have either trained solely on data from robotic arms with broadly similar embodiments or manually converted input data to a common format to make it easier to process. In contrast, CrossFormer can process images from cameras positioned above a robot, at head height or on a robotic arms wrist, as well as joint position data from both quadrupeds and robotic arms, without any tweaks.

The result is a single control policy that can operate single robotic arms, pairs of robotic arms, quadrupeds, and wheeled robots on tasks as varied as picking and placing objects, cutting sushi, and obstacle avoidance. Crucially, it matched the performance of specialized models tailored for each robot and outperformed previous approaches trained on diverse robotic data. The team even tested whether the model could control an embodiment not included in the dataset—a small quadcopter. While they simplified things by making the drone fly at a fixed altitude, CrossFormer still outperformed the previous best method.

“That was definitely pretty cool,” says Ria Doshi, an undergraduate student at Berkeley. “I think that as we scale up our policy to be able to train on even larger sets of diverse data, it’ll become easier to see this kind of zero shot transfer onto robots that have been completely unseen in the training.”

The limitations of one AI model for all robots

The team admits there’s still work to do, however. The model is too big for any of the robots’ embedded chips and instead has to be run from a server. Even then, processing times are only just fast enough to support real-time operation, and Walke admits that could break down if they scale up the model. “When you pack so much data into a model it has to be very big and that means running it for real-time control becomes difficult.”

One potential workaround would be to use an approach called distillation, says Oier Mees, a postdoctoral research at Berkley and part of the CrossFormer team. This essentially involves training a smaller model to mimic the larger model, and if successful can result in similar performance for a much smaller computational budget.

But of more importance than the computing resource problem is that the team failed to see any positive transfer in their experiments, as CrossFormer simply matched previous performance rather than exceeding it. Walke thinks progress in computer vision and natural language processing suggests that training on more data could be the key.

Others say it might not be that simple. Jeannette Bohg, a professor of robotics at Stanford University, says the ability to train on such a diverse dataset is a significant contribution. But she wonders whether part of the reason why the researchers didn’t see positive transfer is their insistence on not aligning the input data. Previous research that trained on robots with similar observation and action data has shown evidence of such cross-overs. “By getting rid of this alignment, they may have also gotten rid of this significant positive transfer that we’ve seen in other work,” Bohg says.

It’s also not clear if the approach will boost performance on tasks specific to particular embodiments or robotic applications, says Ram Ramamoorthy, a robotics professor at Edinburgh University. The work is a promising step towards helping robots capture concepts common to most robots, like “avoid this obstacle,” he says. But it may be less useful for tackling control problems specific to a particular robot, such as how to knead dough or navigate a forest, which are often the hardest to solve.