A new formal definition of agency provides clear principles for causal modeling of AI agents and the incentives they face
We want to build safe and aligned artificial general intelligence (AGI) systems that pursue the goals intended by their designers. Causal influence diagrams (CID) are a way of modeling decision-making situations that allow us to reason about incentives for agents. For example, here is a CID for a 1-step Markovian decision-making process – a typical framework for decision-making problems.
S1 represents the initial state, A1 represents the agent's decision (square), S2 the next state. R2 is the agent's reward/utility (diamond). Strong links specify causal influence. The dotted edges specify the information links – what the agent knows when making its decision.
By linking training configurations to the incentives that shape agent behavior, CIDs help highlight potential risks before training an agent and can inspire better agent designs. But how do you know if a CID is an accurate model of a training configuration?
Our new newspaper, Agent Discoveryintroduces new ways to solve these problems, including:
- The first formal causal definition of agents: Agents are systems that would adapt their policies if their actions influenced the world in a different way
- An algorithm for discovering agents from empirical data
- A translation between causal models and CID
- Resolve previous confusions due to incorrect causal modeling of agents
Combined, these results provide an additional layer of assurance that no modeling errors have been made, meaning that CIDs can be used to analyze an agent's incentives and security properties with greater confidence .
Example: modeling a mouse as an agent
To illustrate our method, consider the following example consisting of a world containing three squares, with a mouse starting at the middle square choosing to go left or right, arriving at its next position and eventually getting some cheese. The ground is icy, so the mouse could slip. Sometimes the cheese is on the right, sometimes on the left.
The mouse and cheese environment.
This can be represented by the following CID:
CID for mouse. D represents the left/right decision. The U represents whether the mouse gets cheese or not.
The intuition that the mouse would choose different behavior for different environmental parameters (icing, cheese distribution) can be captured by a mechanized causal graph, which for each variable (at the object level), also includes a mechanism variable that governs how the variable depends on its parents. Basically, we allow links between variables in the mechanism.
This graph contains additional mechanism nodes in black, representing mouse policy and the distribution of frosting and cheese.
Mechanized causal graph for mouse and cheese environment.
The boundaries between mechanisms represent direct causal influence. Blue edges are special Terminal edges – basically, the edges of the mechanism A~ → B~ that would still be there, even if the object-level variable A was changed so that it had no outgoing edges.
In the example above, since U has no children, its mechanism edge must be terminal. But the edge of the mechanism X~ → D~ is not terminal, because if we cut X from its child U, then the mouse will no longer adapt its decision (because its position will not affect whether it gets the cheese ).
Causal discovery of agents
Causal discovery infers a causal graph from experiments involving interventions. In particular, one can discover an arrow from a variable A to a variable B by experimentally intervening on A and checking whether B responds, even if all other variables remain fixed.
Our first algorithm uses this technique to discover the mechanized causal graph:
Algorithm 1 takes the system's interventional data (mouse and cheese environment) as input and uses causal discovery to produce a mechanized causal graph. See the document for more details.
Our second algorithm transforms this mechanized causal graph into a game graph:
Algorithm 2 takes a mechanized causal graph as input and maps it to a game graph. An incoming terminal edge indicates a decision, an outgoing edge indicates a utility.
Taken together, Algorithm 1 followed by Algorithm 2 allows us to discover agents from causal experiments, by representing them using CID.
Our third algorithm transforms the game graph into a mechanized causal graph, allowing us to translate between the representations of the game and those of the mechanized causal graph under some additional assumptions:
Algorithm 3 takes a game graph as input and maps it to a mechanized causal graph. A decision indicates an incoming terminal edge, a utility indicates an outgoing terminal edge.
Better security tools for modeling AI agents
We proposed the first formal causal definition of agents. Grounded in causal discovery, our key idea is that agents are systems that adapt their behavior in response to changes in how their actions influence the world. Indeed, our algorithms 1 and 2 describe a precise experimental process that can help evaluate whether a system contains an agent.
Interest in causal modeling of AI systems is growing rapidly, and our research bases this modeling on causal discovery experiments. Our paper demonstrates the potential of our approach by improving the security analysis of several examples of AI systems and shows that causality is a useful framework for discovering whether there is an agent in a system – a key concern for evaluating risks related to AGI.
Want to know more ? Check out our paper. Comments and comments are welcome.