Our approach to analyzing and mitigating future risks posed by advanced AI models
Google DeepMind has continued to push the boundaries of AI, developing models that have transformed our understanding of what is possible. We believe that AI technology on the horizon will provide society with invaluable tools to help address critical global challenges, such as climate change, drug discovery and economic productivity. At the same time, we recognize that as we continue to push the boundaries of AI capabilities, these advances could eventually lead to new risks beyond those posed by current models.
Today we present to you our Border security framework – a set of protocols to proactively identify future AI capabilities that could cause serious harm and put in place mechanisms to detect and mitigate them. Our framework focuses on serious risks resulting from powerful model-level capabilities, such as exceptional agency or sophisticated cyber capabilities. It is designed to complement our alignment research, which trains models to act in accordance with human values and societal goals, as well as Google's existing suite of AI accountability and safety. practices.
The framework is exploratory and we expect it to evolve significantly as we learn from its implementation, deepen our understanding of AI risks and assessments, and collaborate with industry, academia and the government. Although these risks are beyond the reach of current models, we hope that implementing and improving the Framework will help us prepare to address them. Our goal is to have this initial framework fully implemented by early 2025.
The framework
The first version of the Framework announced today builds on our research on assess critical capabilities in frontier models, and follows the emerging approach of Responsible capacity scaling. The framework has three key elements:
- Identify capabilities that a model may have with the potential for serious damage. To do this, we look for pathways through which a model could cause serious harm in high-risk areas, and then determine the minimum level of capabilities that a model must possess to play a role in causing such damage. We call these “Critical Capacity Levels” (CCLs) and they guide our assessment and mitigation approach.
- Periodically evaluate our boundary models to detect when they reach these critical capacity levels. To do this, we will develop suites of model evaluations, called “early warning evaluations,” that will alert us when a model is approaching a CCL, and run them frequently enough that we will be notified before before this threshold is reached.
- Apply a mitigation plan when a model passes our early warning assessments. This should take into account the overall balance of benefits and risks, as well as the deployment contexts envisaged. These mitigations will primarily focus on security (preventing model exfiltration) and deployment (preventing misuse of critical capabilities).
Risk areas and mitigation levels
Our initial set of critical capability levels is based on the study of four areas: autonomy, biosecurity, cybersecurity, and machine learning research and development (R&D). Our early research suggests that the capabilities of future foundation models are very likely to present serious risks in these areas.
In terms of autonomy, cybersecurity and biosecurity, our main goal is to assess the extent to which threat actors could use a model with advanced capabilities to carry out harmful activities with serious consequences. For machine learning R&D, the focus is on whether models with such capabilities would enable the release of models with other critical capabilities or enable rapid and unmanageable escalation of machine learning capabilities. AI. As we further research these and other risk areas, we expect these CCLs to evolve and more CCLs at higher levels or in other risk areas to be added.
To allow us to tailor the strength of mitigations to each CCL, we have also described a set of security and deployment mitigations. Higher-level security mitigations result in better protection against exfiltration of model weights, and higher-level deployment mitigations enable tighter management of critical features. However, these measures can also slow the pace of innovation and reduce the widespread accessibility of capabilities. Finding the optimal balance between mitigating risks and promoting access and innovation is paramount for the responsible development of AI. By weighing overall benefits against risks and considering the context of model development and deployment, we aim to ensure responsible progress in AI that unlocks transformative potential while protecting against unintended consequences.
Investing in science
The research underpinning the Framework is in its infancy and progressing rapidly. We invested significantly in our Frontier Safety team, which coordinated cross-functional efforts behind our framework. Their mission is to advance the science of border risk assessment and refine our framework based on our improved knowledge.
The team developed an assessment suite to assess risks to critical capabilities, with a particular focus on autonomous LLM agents, and tested it on our state-of-the-art models. Their recent article describing these assessments also explores the mechanisms that could form a future”early warning system“. It describes technical approaches for assessing how close a model is to success at a task it currently fails to accomplish, and also includes predictions about the future capabilities of a team of expert forecasters.
Staying true to our AI principles
We will review and evolve the framework periodically. In particular, as we test the framework and deepen our understanding of risk areas, CCLs, and deployment contexts, we will continue our work calibrating CCL-specific mitigations.
At the heart of our work are Google services Principles of AI, which commit us to seeking widespread benefits while mitigating risks. As our systems improve and their capabilities increase, measures such as the Frontier Safety Framework will ensure that our practices continue to meet these commitments.
We look forward to working with others from industry, academia and government to develop and refine the framework. We hope that sharing our approaches will make it easier to work with others to agree on standards and best practices for assessing the security of future generations of AI models.