In our recent paper, published in Nature Human Behavior, we provide a proof of concept that deep reinforcement learning (RL) can be used to find economic policies that people will overwhelmingly vote for in a simple game. The article thus addresses a key challenge in AI research: how to train AI systems that align with human values.
Imagine that a group of people decide to pool funds to make an investment. The investment is profitable and a profit is made. How should profits be distributed? A simple strategy is to split the return equally among investors. But that might be unfair because some people contributed more than others. Alternatively, we could reimburse everyone in proportion to the size of their initial investment. That seems fair, but what if people had different levels of assets to begin with? If two people contribute the same amount, but one gives a fraction of their available funds and the other gives all of them, should they receive the same share of the profits?
This question of the redistribution of resources in our economies and societies has long sparked controversy among philosophers, economists and political scientists. Here, we use deep RL as a testbed to explore ways to solve this problem.
To meet this challenge, we created a simple game involving four players. Each instance of the game took place over 10 rounds. In each round, each player was allocated funds, with the size of the prize pool varying between players. Each player made a choice: they could keep these funds for themselves or invest them in a common kitty. The invested funds were guaranteed to grow, but there was a risk, as players did not know how the profits would be distributed. Instead, they were told that for the first 10 rounds, one referee (A) made the redistribution decisions, and that for the second 10 rounds, a different referee (B) took over. At the end of the game, they voted for A or B and played another game with that referee. Human players of the game were allowed to keep the profits of this end game, so they had an incentive to report their preferences accurately.
In reality, one of the arbiters was a predefined redistribution policy, and the other was designed by our deep RL agent. To train the agent, we first recorded data from a large number of human groups and taught a neural network to copy the way people played the game. This simulated population could generate unlimited data, allowing us to use data-intensive machine learning methods to train the RL agent to maximize the votes of these “virtual” players. In doing so, we then recruited new human actors and compared the mechanism designed by the AI to well-known benchmarks, such as a libertarian policy that returns funds to people in proportion to their contributions.
When we studied the votes of these new players, we found that the policy designed by Deep RL was more popular than the baselines. In fact, when we conducted a new experiment asking a fifth human player to take on the role of referee and trained them to try to maximize votes, the policy implemented by this “human referee” was even less popular than that of our agent.
AI systems have sometimes been criticized for having learning policies that might be inconsistent with human values, and this problem of “value alignment” has become a major concern in AI research. One of the merits of our approach is that AI directly learns to maximize the stated preferences (or votes) of a group of people. This approach can help ensure that AI systems are less likely to learn dangerous or unfair policies. In fact, when we analyzed the policy discovered by AI, it incorporated a mix of ideas previously proposed by human thinkers and experts to solve the redistribution problem.
First, the AI chose to redistribute funds to people in proportion to their income. relative rather than absolute contribution. This means that when redistributing funds, the agent takes into account the initial means of each player, as well as their willingness to contribute. Second, the AI system particularly rewarded players whose relative contribution was more generous, perhaps encouraging others to do the same. It is important to note that the AI only discovered these policies by learning to maximize human votes. The method therefore ensures that humans remain “in the know” and that AI produces human-compatible solutions.
By asking people to vote, we have exploited the principle of majoritarian democracy to decide what people want. Despite its broad appeal, it is widely recognized that democracy comes with a condition that the preferences of the majority are taken into account over those of the minority. In our study, we ensured that – as in most societies – this minority is made up of more generously endowed actors. But more work is needed to understand how to balance the relative preferences of majority and minority groups, designing democratic systems that allow all voices to be heard.