To teach an AI agent a new task, such as opening a kitchen cabinet, researchers often use reinforcement learning – a process of trial and error in which the agent is rewarded for taking measures that bring it closer to its objective.
In many cases, a human expert must carefully design a reward function, which is an incentive mechanism that motivates the agent to explore. The human expert must update this reward function iteratively as the agent explores and tries different actions. This can be time-consuming, inefficient, and difficult to scale, especially when the task is complex and involves many steps.
Researchers from MIT, Harvard University and the University of Washington have developed a new approach to reinforcement learning that does not rely on an expert-designed reward function. Instead, it leverages crowdsourced feedback, collected from many non-expert users, to guide the agent in learning to achieve its goal.
While other methods also attempt to use feedback from non-experts, this new approach allows the AI agent to learn faster, despite the fact that the data collected from users is often full of errors. This noisy data can cause other methods to fail.
Additionally, this new approach allows feedback to be collected asynchronously, so that non-expert users around the world can contribute to agent training.
“Today, one of the longest and most difficult steps in designing a robotic agent is designing the reward function. Today, reward functions are designed by expert researchers – a paradigm that is not scalable if we want to teach many different tasks to our robots. Our work proposes a way to extend robot learning by outsourcing the design of the reward function and allowing non-experts to provide useful feedback,” says Pulkit Agrawal, assistant professor in the Department of Electrical and Engineering Engineering. (EECS) who runs the Improbable AI Lab at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).
In the future, this method could help a robot quickly learn to perform specific tasks in a user’s home, without the owner needing to show the robot physical examples of each task. The robot could explore on its own, with crowdsourced non-expert feedback guiding its exploration.
“In our method, the reward function guides the agent towards what it should explore, instead of telling it exactly what it should do to accomplish the task. So even though human monitoring is somewhat imprecise and noisy, the agent is still able to explore, which helps it learn better,” says lead author Marcel Torne ’23, a research assistant at the Unlikely AI Lab.
Torne is joined on the paper by his MIT advisor, Agrawal; lead author Abhishek Gupta, assistant professor at the University of Washington; as well as others at the University of Washington and MIT. The research will be presented at the Neural Information Processing Systems Conference next month.
One way to collect user feedback for reinforcement learning is to show a user two photos of the states the agent achieved and then ask which state is closest to a goal. For example, perhaps a robot’s goal is to open a kitchen cabinet. One image might show the robot opening the cabinet, while the second might show it opening the microwave. A user would choose the photo in the “best” state.
Some previous approaches attempt to use this binary and participatory feedback to optimize a reward function that the agent would use to learn the task. However, because non-experts are likely to make errors, the reward function can become very noisy, so the agent may get stuck and never achieve its goal.
“Basically, the agent would take the reward function too seriously. It would try to match the reward function perfectly. So instead of directly optimizing the reward function, we simply use it to tell the robot which areas it should explore,” explains Torne.
He and his collaborators decoupled the process into two distinct parts, each driven by its own algorithm. They call their new reinforcement learning method HuGE (Human Guided Exploration).
On one hand, a goal selection algorithm is continually updated with crowdsourced human feedback. Feedback is not used as a reward function, but rather to guide the agent’s exploration. In a sense, non-expert users abandon the breadcrumbs that gradually lead the agent towards its objective.
On the other hand, the agent explores alone, in a self-supervised manner and guided by the objective selector. It collects images or videos of the actions it attempts, which are then sent to humans and used to update the goal selector.
This reduces the area to be explored by the agent, leading it to more promising areas closer to its objective. But if there is no feedback, or if the feedback takes time to arrive, the agent will continue to learn on its own, albeit more slowly. This allows feedback to be collected infrequently and asynchronously.
“The exploration loop can continue on its own, because it will simply explore and learn new things. And then when you get a better signal, it will explore in a more concrete way. You can just let them run at their own pace,” adds Torne.
And because feedback only gently guides the agent’s behavior, the agent will eventually learn to complete the task even if users provide incorrect answers.
The researchers tested this method on a number of simulated and real-world tasks. In simulation, they used HuGE to efficiently learn tasks with long sequences of actions, such as stacking blocks in a particular order or navigating a large maze.
In real-world testing, they used HuGE to train robotic arms to draw the letter “U” and select and place objects. For these tests, they collected data from 109 non-expert users in 13 different countries spanning three continents.
In real and simulated experiments, HuGE helped agents learn to achieve their goal faster than other methods.
The researchers also found that data from non-experts produced better performance than synthetic data, produced and labeled by the researchers. For non-expert users, labeling 30 images or videos took less than two minutes.
“This makes this method very promising in terms of the possibility of scaling up this method,” adds Torne.
In a related paper, presented by the researchers at the recent Robot Learning Conference, they improved HuGE so that an AI agent can learn to perform the task and then autonomously reset the environment to continue to learn. For example, if the agent learns to open a cabinet, the method also guides him in closing the cabinet.
“We can now have it learn completely autonomously without the need for human resets,” he says.
The researchers also point out that, in this as in other learning approaches, it is essential to ensure that AI agents are aligned with human values.
In the future, they want to continue perfecting HuGE so that the agent can learn other forms of communication, such as natural language and physical interactions with the robot. They also want to apply this method to train several agents at once.
This research is funded, in part, by the MIT-IBM Watson AI Lab.