Responsibility and security
New research offers framework for evaluating general-purpose models against new threats
To responsibly pioneer at the forefront of artificial intelligence (AI) research, we must identify new capabilities and risks in our AI systems as early as possible.
AI researchers already use a range of evaluation criteria to identify undesirable behavior in AI systems, such as AI systems making misleading statements, biased decisions, or repetition of copyrighted content. Today, as the AI community builds and deploys increasingly powerful AI, we must expand the evaluation portfolio to include the possibility of extreme risks from general-purpose AI models with strong skills in manipulation, deception, cyber-offensive or other dangerous capabilities.
In our last articlewe introduce a framework for assessing these new threats, co-authored with colleagues from the University of Cambridge, University of Oxford, University of Toronto, University of Montreal, OpenAI, Anthropic, Alignment Research Center, Center for Long-Term Resilience and Center for AI Governance.
Model safety assessments, including those assessing tail risks, will be a critical part of the safe development and deployment of AI.
Extreme risk assessment
General-purpose models typically learn their capabilities and behaviors during training. However, existing methods for driving the learning process are imperfect. For example, Previous search at Google, DeepMind explored how AI systems can learn to pursue undesirable goals even when we reward them properly for good behavior.
Responsible AI developers must look ahead and anticipate possible future developments and new risks. After continued progress, future general-purpose models could learn a variety of dangerous capabilities by default. For example, it is plausible (although uncertain) that future AI systems will be capable of conducting offensive cyber operations, skillfully deceiving humans in dialogue, manipulating humans into performing harmful actions, design or acquire weapons (e.g. biological, chemical), etc. tune and operate other high-risk AI systems on cloud computing platforms, or assist humans in any of these tasks.
Malicious individuals who access these models could abuse their abilities. Or, due to misalignments, these AI models could take harmful actions, even without anyone wanting them to.
Model evaluation helps us identify these risks in advance. In our framework, AI developers would use model evaluation to discover:
- The extent to which a model possesses certain “dangerous capabilities” that could be used to threaten security, exert influence, or evade surveillance.
- How inclined the model is to apply its abilities to cause harm (i.e. model alignment). Alignment assessments should confirm that the model behaves as expected, even in a very wide range of scenarios, and, where possible, should examine the internal workings of the model.
The results of these assessments will help AI developers understand whether sufficient ingredients to cause extreme risk are present. The most high-risk cases will involve multiple dangerous abilities combined. The AI system does not need to provide all the ingredients, as this diagram shows:
As a general rule: the AI community should consider an AI system to be very dangerous if it has a capability profile sufficient to cause extreme harm, supposing it is used incorrectly or misaligned. To deploy such a system in the real world, an AI developer would have to demonstrate an unusually high level of security.
Evaluation of the model as a critical governance infrastructure
If we have better tools to identify risky models, companies and regulators will be able to better ensure:
- Responsible training: Responsible decisions are made about whether and how to form a new model that shows the first signs of risk.
- Responsible deployment: Responsible decisions are made about if, when and how to deploy potentially risky models.
- Transparency: Useful and actionable information is communicated to stakeholders, to help them prepare for or mitigate potential risks.
- Appropriate security: Rigorous information security controls and systems are applied to models that may present extreme risks.
We developed a model for how model evaluations for tail risks should feed into important decisions regarding the training and deployment of a versatile, high-performing model. The developer conducts evaluations throughout and grants access to the structured model to external security researchers and model listeners so that they can lead additional assessments The assessment results can then inform risk assessments before training and deploying the model.
Important early work on model evaluations for tail risks is already underway at Google DeepMind and elsewhere. But much greater progress – both technical and institutional – is needed to build an assessment process that detects all possible risks and helps guard against future and emerging challenges.
Model evaluation is not a panacea; some risks could slip through the cracks, for example because they depend too much on factors external to the model, such as complex social, political and economic forces in society. Model evaluation should be combined with other risk assessment tools and a broader commitment to safety across industry, government and civil society.
Google’s recent blog on responsible AI states that “individual practices, shared industry standards, and sound government policies would be essential for AI to succeed.” We hope that many others working in AI and industries impacted by this technology will come together to create approaches and standards for developing and deploying AI safely, for the benefit of all .
We believe that having processes to track the emergence of risky properties in models and adequately respond to findings of concern is an essential part of being a responsible developer operating at the frontier of AI capabilities.