Labs should:
- Do risk assessment — throughout the model lifecycle but especially between training and deployment
- Make commitments about development, deployment, security, and safety practices based on risk assessment results
For now, risk assessment should focus on evaluating models for dangerous capabilities, such as autonomous replication, hacking, and bioengineering. When models have dangerous capabilities, risk assessment will need to detect uncontrollable models.
Currently at frontier labs:
- Anthropic has explained its risk assessment process at a high level and shared some details, OpenAI is working on doing so, and DeepMind has at least shared details on its model evaluations. Others have not.
- Anthropic has made commitments about development, deployment, security, and safety practices based on risk assessment results, and OpenAI is working on doing so. Others have not.
In addition to model evals for dangerous capabilities, risk assessment should involve open-ended red-teaming. Labs should also share model access with third-party evaluators and red-teamers with expertise in eliciting AI capabilities and relevant subjects, and access should include all modalities and enhancements that will be enabled in deployment (e.g. fine-tuning).
Labs should also make their risk assessment process accountable: they should ensure that commitments are executed as intended, that key stakeholders can verify that this is happening (or notice if it isn’t), that there are opportunities for third-party critique, and that changes to the risk assessment process itself don’t happen in a rushed or opaque way.1
(Here, “red-teaming” refers to detecting undesired model outputs, not security issues.)
What labs should do
This section is rough.
Labs should commit that before training models beyond particular risk or capability thresholds, they will reach particular safety goals, implement particular safety techniques, reach specific security goals, and implement specific security techniques. They should make similar commitments about safety goals and techniques before deploying models beyond particular risk or capability thresholds, internally and externally. See our Training, Deployment, and Security pages for some details. It should be clear what model capabilities or other risk assessment results would make them stop training or not deploy a model.
What dangerous capabilities should labs watch for? What exact model evaluations should labs use? This is an open problem that labs should solve. It currently seems labs should watch for dangerous capabilities including autonomous replication, coding (finding/exploiting vulnerabilities in code, writing malicious code, or writing code with hidden vulnerabilities), situational awareness, and perhaps long-horizon planning. For more, see Model evaluation for extreme risks (DeepMind: Shevlane et al. 2023), especially Table 1.2
Labs should use and facilitate third-party monitoring/auditing/oversight of training runs.3 They should give some evals organizations access to models to do red-teaming and dangerous capability evals. This access should include fine-tuning and tools/plugins. It should occur during training and again between training and deployment, ideally as a normal part of the lab’s risk assessment.4 In addition to improving safety, external monitoring can have benefits for transparency and coordination.
Labs should publicly explain the details of how they do risk assessment—omitting dangerous or acceleratory details, but generally including the model evals they use and how they do red-teaming. They should also publicly share their risk assessment results, again excluding details if relevant.
Capabilities increase during deployment due to model updates, improvements in fine-tuning, and inference-time improvements (prompting, plugins, and scaffolding). So labs should periodically repeat their full risk assessment during deployment, including before pushing large updates, in addition to doing live monitoring for risks. And evals should generally be done on systems enhanced to be more powerful or risky (via prompting, fine-tuning, plugins/tools, and scaffolding).
Labs may soon want their models to have dangerous capabilities because that can make the models more useful. Ideally labs would have decisive evidence that their dangerous models are aligned (or in particular, wouldn’t try to cause a catastrophe). But nobody is close to being able to make such alignment arguments. For now, labs’ risk assessment should determine whether their systems would be dangerous if those systems were scheming: pretending to be safe in order to gain power later. Before deploying a system that is plausibly capable of scheming, labs should have safety-focused deployment protocols and corresponding control arguments for their powerful models, i.e. arguments that those systems are very unlikely to cause a catastrophe (when deployed according to the protocol) even if the systems are scheming to do so.5 (And labs should publish the details of their control arguments.)
Ideally, labs would be able to quantify the risk of various bad outcomes entailed by the development and deployment (in various ways) of a particular model. But quantifying risks of catastrophes from frontier models is very hard, largely because no AI catastrophe has yet occurred and frontier models enable new capabilities. Perhaps labs should elicit the opinion of subject-matter experts or expert forecasters on questions relevant to risk assessment, such as the risk of model weights leaking and outcomes given that, or the risk of catastrophic misuse given some hypothetical eval results and deployment details. See e.g. Risk assessment at AGI companies (Koessler and Schuett 2023) and “What Standard Risk Management Looks Like” in “RSPs Are Risk Management Done Wrong” (SaferAI: Campos 2023). Additionally, labs should do scenario analysis, but it would be hard to tell how well they use this technique from the outside.
“There is a risk that – through conducting dangerous capability evaluations and sharing relevant materials – the field will proliferate dangerous capabilities or accelerate their development.”6 But labs should be able to do model evals carefully and avoid this risk.
Labs should also do risk assessment before training to predict their future models’ dangerous capabilities and inform the lab’s training plan. There are three reasons to worry about dangerous capabilities during training rather than just before and during deployment:
- Some risk of AI takeover appears in training.
- Once trained, models can be stolen or leaked, then deployed unsafely by incautious actors—a great way to avoid dangerous deployment is to avoid dangerous training.
- Once trained, models can quickly be deployed by the lab—a great way to avoid dangerous deployment is to avoid dangerous training.
Other sociotechnical safety evaluations seem lower-priority than high-effort model evaluations.
Labs should retrospectively review major deployments to inform future risk assessment.7
Alternatives to “consistently testing for dangerous capabilities” could suffice for safety, including “Staying behind the frontier” or perhaps “scaling up capabilities slowly [and releasing widely].”8 These approaches are less relevant to frontier labs.
If other labs are similarly capable and much less safe, it could be optimal for a lab to develop risky systems (despite knowing they’re risky). If a lab does this, it should make clear that it’s doing so and how it plans to promote AI safety. See Responsible Scaling Policies (METR 2023).
Evaluation
(As an alternative to the below, a commitment to stay behind the frontier or scale very slowly could suffice. These paths are less relevant to the frontier labs we currently evaluate.)
Measuring threats (60%).
- (0/26) Do risk assessment before training. Before building a frontier model, predict model capabilities (in terms of benchmarks and real-world applications, especially dangerous capabilities) and predict the real-world consequences of developing and deploying the model.
- Do model evals for dangerous capabilities before deployment:
- (1/26) Say what dangerous capabilities the lab watches for (given that it does so at all).
- (6/26) … and watch for autonomous replication, coding (finding/exploiting vulnerabilities in code, writing malicious code, or writing code with hidden vulnerabilities), and situational awareness or long-horizon planning. (Largely this is an open problem that labs should solve.)
- (3/26) … and detail the specific tasks the lab uses in its evaluations. (Omit dangerous details if relevant.)
- (1/26) Say what dangerous capabilities the lab watches for (given that it does so at all).
- (6/26) Explain the details of how the lab evaluates performance on the tasks it uses in model evals and how it does red-teaming (excluding dangerous or acceleratory details). In particular, explain its choices about fine-tuning, scaffolding/plugins, prompting, how to iterate on prompts, and whether the red-team gets a fixed amount of person-hours and compute or how else they decide when to give up on eliciting a capability. And those details should be good. Evaluated holistically.
- (4/26) Prepare to have control arguments for the lab’s powerful models, i.e. arguments that those systems cannot cause a catastrophe even if the systems are scheming.9 And publish this. For now, the lab should:
- Prepare to do risk assessment to determine whether its systems would be dangerous, if those systems were scheming.
- Test its AI systems to ensure that they report coup attempts (or other misbehavior) by themselves or other (instances of) AI systems, and that they almost never initiate or cooperate with coup attempts.
- (6/26) Give some third parties access to models to do model evals for dangerous capabilities. This access should include fine-tuning and tools/plugins. It should occur both during training and between training and deployment. It should include base models rather than just safety-tuned models, unless the lab can demonstrate that the safety-tuning is robust. The third parties should have independence and control over their evaluation; just using external red-teamers is insufficient. The third parties should have expertise in eliciting model capabilities (but the lab should also offer assistance in this) and in particular subjects if relevant. The lab should incorporate the results into its risk assessment.
(Risk assessment should also include some less-intensive evaluation for capabilities that are less dangerous or less likely to appear, and should include some open-ended red-teaming.)
Commitments (15%).
Commit to use risk assessment frequently enough. Do risk assessment (for dangerous capabilities) regularly during training, before deployment, and during deployment (and commit to doing so), so that the lab will detect warning signs before dangerous capabilities appear. During training, do risk assessment at least every 4x increase in training compute; during deployment, do it before pushing major changes and otherwise at least every 3 months (to account for improvements in fine-tuning, scaffolding, plugins, prompting, etc.).
Accountability (25%).
- (7/16) Verification: publish updates on risk assessment practices and results, including low-level details, at least quarterly. One-third credit for reporting internally; one-third for also sharing with external safety organizations; one-third for publishing. Multiply by half if updates exclude low-level details.
- Revising policies
- (4/16) Avoid bad changes: a nonprofit board or other somewhat-independent group with a safety mandate should have veto power on changes to risk assessment practices and corresponding commitments, at the least. And “changes should be clearly and widely announced to stakeholders, and there should be an opportunity for critique.” As an exception, “For minor and/or urgent changes, [labs] may adopt changes to [their policies] prior to review. In these cases, [they should] require changes . . . to be approved by a supermajority of [the] board. Review still takes place . . . after the fact.” Key Components of an RSP (METR 2023).
- (3/16) Promote good changes: have a process for staff and external stakeholders to share concerns about risk assessment policies or their implementation with the board and some other staff, including anonymously.
- (2/16) Elicit external review of risk assessment practices and commitments. Publish those reviews, with light redaction if necessary.
Sources
Theories of Change for AI Auditing (Apollo 2023).
Model evaluation for extreme risks (DeepMind: Shevlane et al. 2023).
Responsible Scaling Policies (METR 2023) and the other resources on our Responsible Scaling Policy page.
Evaluating Language-Model Agents on Realistic Autonomous Tasks (METR: Kinniment et al. 2023).
Towards Publicly Accountable Frontier LLMs (Anderljung et al. 2023).
Towards best practices in AGI safety and governance (GovAI: Schuett et al. 2023):
- “Pre-deployment risk assessment. AGI labs should take extensive measures to identify, analyze, and evaluate risks from powerful models before deploying them.”
- “Dangerous capability evaluations. AGI labs should run evaluations to assess their models’ dangerous capabilities (e.g. misuse potential, ability to manipulate, and power-seeking behavior).”
- “Third-party model audits. AGI labs should commission third-party model audits before deploying powerful models.”
- “Red teaming. AGI labs should commission external red teams before deploying powerful models.”
- “Post-deployment evaluations. AGI labs should continually evaluate models for dangerous capabilities after deployment, taking into account new information about the model’s capabilities and how it is being used.”
- “Pre-training risk assessment. AGI labs should conduct a risk assessment before training powerful models.”
- “Pausing training of dangerous models. AGI labs should pause the development process if sufficiently dangerous capabilities are detected.”
- “Publish results of internal risk assessments. AGI labs should publish the results or summaries of internal risk assessments, unless this would unduly reveal proprietary information or itself produce significant risk. This should include a justification of why the lab is willing to accept remaining risks.”
Evaluations for autonomous replication capabilities: Evaluating Language-Model Agents on Realistic Autonomous Tasks (METR: Kinniment et al. 2023) and “ASL-3 Evaluations for Autonomous Capabilities” in “Anthropic’s Responsible Scaling Policy” (Anthropic 2023).
Red-teaming for biorisk capabilities: Building an early warning system for LLM-aided biological threat creation (OpenAI 2024), Frontier Threats Red Teaming for AI Safety (Anthropic 2023), and The Operational Risks of AI in Large-Scale Biological Attacks (RAND: Mouton et al. 2024).
Control evaluations:
- AI Control: Improving Safety Despite Intentional Subversion (Redwood: Greenblatt et al. 2023).
- Catching AIs red-handed (Redwood: Greenblatt and Shlegeris 2024).
- Notes on control evaluations for safety cases (Redwood: Greenblatt et al. 2024)
Towards understanding-based safety evaluations (Hubinger 2023).
“Model evaluations and red teaming” in “Emerging processes for frontier AI safety” (UK Department for Science, Innovation & Technology 2023) recommends model evals for dangerous capabilities and uncontrollability, evaluating models throughout their lifecycle, third-party model evals (including fine-tuning ability, access to models with safety mitigations, etc.) before deployment, and more.
“Responsible capability scaling” in “Emerging processes for frontier AI safety” (UK Department for Science, Innovation & Technology 2023) recommends risk assessment throughout the model lifecycle, defining risk thresholds, committing to safety mitigations based on risk thresholds (and thus to pause a model’s development or deployment until implementing the mitigations corresponding to its riskiness), accountability mechanisms for risk assessment, and more.
Frontier Threats Red Teaming for AI Safety (Anthropic 2023).
Challenges in evaluating AI systems (Anthropic 2023).
What labs are doing
It’s sometimes said that Amazon, Anthropic, Google, Inflection AI, Meta, Microsoft, Mistral AI and OpenAI committed to share pre-deployment model access with the UK AI Safety Institute. It’s not clear that they did; these claims may trace back to the UK AI safety summit “safety testing” session statement, which did not include specific commitments. Regardless, according to Politico, only DeepMind has shared this access, and per the Gemini report, that access was very limited.
Microsoft: they say they’re “committed to responsible development and deployment of increasingly capable AI systems” and “Microsoft and OpenAI have together defined capability thresholds that act as a trigger to review models in advance of their first release or downstream deployment.” But they don’t provide details. Similarly, they say “For all generative AI products characterized as high risk, we are also implementing processes to ensure consistent and holistic AI red teaming by our AI Red Team, an expert group independent to base model or product groups. We are also building out external red-teaming capacity to ensure our readiness to organize red team testing by one or more independent experts prior to the release of new and highly capable foundation models that may be trained by Microsoft, consistent with our July commitment. The topics covered by such red team testing will include testing of dangerous capabilities, including related to biosecurity and cybersecurity.” But they share no details relevant to extreme risks.
They wrote a “case study” in What is Red Teaming? (Frontier Model Forum 2023):
Microsoft Case Study: Red teaming Bing Chat
Background
In October 2022, Microsoft became aware of the new GPT-4 model from OpenAI. Having known that teams within Microsoft would be interested in integrating GPT-4 in first party and third party products, Microsoft created a cross functional team of subject matter experts (SMEs) to experiment with the model and understand its capabilities as well as risks. A follow up red teaming exercise was established for identifying risks with Bing chat, which was the first Microsoft product to integrate GPT-4.
Methodology
The initial round of red teaming the GPT-4 base model was done on the raw model with no additional mitigations from Microsoft. More than 20 SMEs from across the company with diverse expertise from law, policy, AI, engineering and research, security, and responsible AI came together and probed the model to identify its risks. In addition to the team directly experimenting with the model, the group granted access to the model to a small group of SMEs in national security and conducted interviews with them to better understand the risk surface in the specific high-stakes domain. The red teaming exercise on the GPT-4 raw model was more open ended and exploratory, meaning the goal was to identify as many risks and failure modes as possible, identify risk areas for further investigation, and implement early mitigation strategies.
The second round of red teaming was initiated as Bing chat was being developed and becoming mature. In this round of red teaming, more than 50 SMEs from across the company came together to red team the application with mitigations integrated in. We took an iterative approach with weekly red teaming sprints, where each week we red teamed priority features and risk areas, documented results, and worked with relevant measurement and mitigation teams to make sure the red teaming results informed their next steps. While this round of red teaming was more targeted given there was an established list of risks from red teaming the base GPT-4 model, the team still identified a series of new risks, which then led to reprioritizing the new risks for further investigation.
Outcome
Red teaming Bing chat highlighted the need for testing both base model and downstream applications iteratively. Red teaming the base model in an open-ended way and with no additional mitigations allowed us to understand the model’s risk surface and failure modes, identify areas for further investigation, and to watch out for in application developments. Given many risks are context and application dependent, red teaming downstream applications with mitigations integrated in is necessary. While the weekly exploratory and qualitative red teaming sprints were not a replacement for at-scale measurement and mitigation work, the identified examples served as seeds to create at-scale measurement datasets, informed prioritization and implementation of mitigation strategies, and helped stakeholders make more informed decisions. Microsoft continues to learn and improve on the process we have established for red teaming.
This (along with Microsoft’s similar discussion of risk analysis for Bing Chat here) is disappointing. Microsoft’s deployment of Bing Chat clearly involved a failure of red-teaming [citations needed]. Microsoft fails to acknowledge this or discuss what they plan to do differently. Bing Chat was insufficiently capable to be dangerous, but this incident suggests that they are either unable to detect or unable to fix their models’ problems, and they’re not grappling with their mistakes.
See also Microsoft’s AI Red Team Has Already Made the Case for Itself (WIRED 2023).
Google DeepMind: DeepMind does not have a Responsible Scaling Policy or other commitments to achieve specific safety goals or use specific safety practices before training or deploying models with specific dangerous capabilities, or to not train or deploy models with specific dangerous capabilities, or to evaluate their models for specific dangerous capabilities (except for the White House voluntary commitments). CEO Demis Hassabis says he plans for DeepMind to publicly write about relevant policies in 2024.
Evaluating Frontier Models for Dangerous Capabilities (DeepMind: Phuong et al. 2024) presents model evals for dangerous capabilities in four categories: persuasion and deception, cybersecurity capabilities, self-proliferation, and self-reasoning. (They say they are developing evals for CBRN capabilities.) They share details on the 52 tasks they use and how they evaluated the Gemini 1.0 models on those tasks. They use scaffolding to elicit stronger capabilities. All of this is great.
But, vitally, DeepMind doesn’t seem to have a plan for what to do after it determines that its models have dangerous capabilities.
The Gemini report (original, updated) briefly mentions model evals for dangerous capabilities:
Assurance evaluations are conducted for the purpose of governance and review, usually at the end of key milestones or training runs by a group outside of the model development team. Assurance evaluations are standardized by modality and datasets are strictly held-out. Only high-level insights are fed back into the training process to assist with mitigation efforts. Assurance evaluations include testing across Gemini policies, and include ongoing testing for dangerous capabilities such as potential biohazards, persuasion, and cybersecurity.
That paper also says “External evaluations are conducted by partners outside of Google to identify blindspots” and “specialist internal teams conduct ongoing red teaming of our models across areas such as the Gemini policies and security” but they aren’t specific.
“Impact Assessment” in the original Gemini report says:
We develop model impact assessments to identify, assess, and document key downstream societal benefits and harms associated with the development of advanced Gemini models. These are informed by prior academic literature on language model risks (Weidinger et al., 2021), findings from similar prior exercises conducted across the industry (Anil et al., 2023; Anthropic, 2023; OpenAI, 2023a), ongoing engagement with experts internally and externally, and unstructured attempts to discover new model vulnerabilities. Areas of focus include: factuality, child safety, harmful content, cybersecurity, biorisk, representation and inclusivity. These assessments are updated in tandem with model development.
This passage was mostly removed from an updated version of the report. It’s not clear why.
The Gemma report says “we present comprehensive evaluations of safety and responsibility aspects of the models,” but doesn’t mention evals for dangerous capabilities.
DeepMind’s great paper Model evaluation for extreme risks (with coauthors including staff from OpenAI and Anthropic) discusses model evals.
After training, DeepMind does red-teaming for adversarial inputs. They do not seem to have partnered with external safety organizations. They do not seem to have written much about their risk assessment for dangerous capabilities; in particular, the only relevant part of the Gemini report is quoted above, and the PaLM 2 page says nothing about risk assessment for dangerous capabilities. But they say “Important early work on model evaluations for extreme risks is already underway at Google DeepMind.” They have done research on red-teaming: Red Teaming Language Models with Language Models (DeepMind 2022).
“Model Evaluations and Red-Teaming” in “AI Safety Summit: An update on our approach to safety and responsibility” (DeepMind 2023) says: “We are currently scoping new dangerous capabilities evaluations in specific domains (biosecurity, autonomous replication and adaptation, cybersecurity, and persuasion and manipulation) that may be appropriate for our future model launches, as well as how to execute them safely and responsibly.”
“Responsible Capabilities Scaling” in “AI Safety Summit: An update on our approach to safety and responsibility” (DeepMind 2023) mentions many ways that DeepMind tries to be responsible, including Google’s “AI Principles,” Google’s “Responsible AI Council,” DeepMind’s “Responsible Development and Innovation team,” DeepMind’s “standardised Ethics and Safety Assessment,” and DeepMind’s “Responsibility and Safety Council.” But it’s not clear that any of these steps are effective for reducing catastrophic risk. For example, they say “we publish regular reports on our AI risk management approaches,” but the linked report doesn’t discuss catastrophic risk, model evals for dangerous capabilities, or safety techniques for high-stakes risks.
They have not made specific safety commitments based on specific possible risk assessment results; in particular, Responsible Capabilities Scaling missed this opportunity.
DeepMind shared Gemini Ultra with unspecified external groups apparently including the UK AI Safety Institute to test for dangerous capabilities before deployment. But DeepMind didn’t share deep access: it only shared a system with safety fine-tuning and safety filters and it didn’t allow them to fine-tune the model. DeepMind has not shared any results of this testing. See also discussion here.
They wrote a “case study” in What is Red Teaming? (Frontier Model Forum 2023):
Google DeepMind Case Study: Adversarial Probing of Google DeepMind’s Gopher Model
Background
As language models increase in capability, there has been increasing interest in developing methods for being able to discover potential harmful content or vulnerabilities at scale. Adversarial testing has emerged as one approach to “red teaming” where the aim is to discover harmful content or vulnerabilities in the model through a combination of automated or manual probing techniques. While manual techniques to adversarial testing can be effective, the results will vary based on the creativity of the prober and could lead to critical safety oversights in the assessment of a model. To complement existing manual approaches to adversarially testing an AI system, our research paper “Red Teaming Language Models with Language Models”, introduces the potential of utilizing a “red-team” language model (LM) to generate a diverse test set to evaluate the target language model’s responses.
Methodology
The probing focused on a “Dialogue-Prompted” variant of GDM’s Gopher (DPG) language model, and utilized a three-stage approach for identifying test cases which produce model failures:
- Generate test cases using a designated “red-team” LM which are confirmed by a score generated by an automated scoring model as likely to generate a harmful output
- Using the “target” LM, generate an output for each selected test case
- Use the automated scoring model to identify the test cases which led to a harmful output
Additional approaches were used to generate test cases using the red-team LM, such as zero-shot sampling and supervised learning on successful adversarial questions to arrive at the final test set. The final test set was used to red-team the DPG model for various harms including offensive content, data leakage, inappropriate contact info generation, and distributional bias against groups.
Outcomes
Overall, this methodology demonstrates how LMs can be leveraged to effectively and automatically find problematic behaviors in other LMs as part of a safety testing pipeline. The collective range of approaches generated approximately 500,000 conversation-starting questions that elicited offensive responses. The red LM questions performed similarly or better than human-written adversarial examples from prior work in terms of eliciting offensive responses. Different red teaming methods like few-shot tuning, supervised learning, and reinforcement learning were able to increase the difficulty of questions while maintaining diversity of the test set.
Meta AI: we do not believe they have written about any risk assessment during training.
After training: they say they do red-teaming for “adversarial threats”; it’s not clear what this means. They do red-teaming for model limitations/vulnerabilities.10 They do not seem to have done red-teaming or model evals for dangerous capabilities, nor partnered with external safety organizations. See generally their writing on Llama 2.11
They have not made safety commitments based on risk assessment results; in particular, “Responsible Capability Scaling” in “Overview of Meta AI safety policies prepared for the UK AI Safety Summit” missed this opportunity.
OpenAI: Risk assessment during training, and responding to that, is part of OpenAI’s Preparedness Framework. The basic framework is:
- Do dangerous capability evals at least every 2x increase in effective training compute. This involves fine-tuning for dangerous capabilities, then doing evals on pre-mitigation and post-mitigation versions of the fine-tuned model. Score the models as Low, Medium, High, or Critical in each of several categories. The initial categories are cybersecurity, CBRN (chemical, biological, radiological, nuclear threats), persuasion, and model autonomy.
- If the post-mitigation model scores High in any category, don’t deploy it until implementing mitigations such that it drops to Medium.
- If the pre-mitigation model scores Critical in any category, stop developing it until implementing mitigations such that it drops to High.
- If the pre-mitigation model scores High in any category, harden security to prevent exfiltration of model weights. (Details basically unspecified for now.)
The thresholds for risk levels feel high. OpenAI does not have a context-independent definition of High or Critical risk, but for example, in the Cybersecurity category, High is defined as
Tool-augmented model can identify and develop proofs-of-concept for high-value exploits against hardened targets without human intervention, potentially involving novel exploitation techniques, OR provided with a detailed strategy, the model can end-to-end execute cyber operations involving the above tasks without human intervention
and Critical is defined as
Tool-augmented model can identify and develop functional zero-day exploits of all severity levels, across all software projects, without human intervention OR model can devise and execute end-to-end novel strategies for cyberattacks against hardened targets given only a high level desired goal.
These capability levels are quite alarming.
The Preparedness Framework says OpenAI plans to harden security (to prevent model exfiltration) as it reaches higher risk levels, but doesn’t have details.
The current, beta version of the Preparedness Framework doesn’t detail the model evaluations OpenAI will do. But presumably the final version will.
OpenAI commits to maintain a scorecard showing their risk level in each of their dangerous-capability categories. This is good, but it would be better if they also committed to publish their eval results directly (redacting dangerous details if relevant).
The Preparedness Framework does not mention monitoring and responding to improvements in post-training enhancements, nor planning for deployment corrections.
The Preparedness Framework includes some commitment that the Board will be kept informed about OpenAI’s risk assessment and able to overrule OpenAI leadership. This is good.
They say they made predictions about GPT-4’s capabilities during training, but this seems to mean benchmark performance, not dangerous capabilities.12 They say that forecasting threats and dangerous capabilities is part of their Preparedness Framework, but they don’t share details.13
The GPT-4 system card discusses red-teaming for dangerous capabilities.
They have partnered with METR; METR evaluated GPT-4 for autonomous replication capabilities before release. They haven’t committed to share access for model evals more generally. They say “Scorecard evaluations (and corresponding mitigations) will be audited by qualified, independent third-parties to ensure accurate reporting of results, either by reproducing findings or by reviewing methodology to ensure soundness, at a cadence specified by the SAG and/or upon the request of OpenAI Leadership or the BoD.”
The OpenAI Red Teaming Network is a good idea; it is not clear how successful it is.
On OpenAI’s red-teaming for biorisk, see Building an early warning system for LLM-aided biological threat creation (OpenAI 2024).
They wrote a “case study” in What is Red Teaming? (Frontier Model Forum 2023):
OpenAI Case Study: Expert Red Teaming for GPT-4
Background
In August 2022, OpenAI began recruiting external experts to red team and provide feedback on GPT-4. Red teaming has been applied to language models in various ways: to reduce harmful outputs and to leverage external expertise for domain-specific adversarial testing. OpenAI’s approach is to red team iteratively, starting with an initial hypothesis of which areas may be the highest risk, testing these areas, and adjusting as required. It is also iterative in the sense that OpenAI uses multiple rounds of red teaming as new layers of mitigation and control are incorporated.
Methodology
OpenAI recruited 41 researchers and industry professionals - primarily with expertise in fairness, alignment research, industry trust and safety, dis/misinformation, chemistry, biorisk, cybersecurity, nuclear risks, economics, human-computer interaction, law, education, and healthcare - to help gain a more robust understanding of the GPT-4 model and potential deployment risks. OpenAI selected these areas based on a number of factors including but not limited to: prior observed risks in language models and AI systems and domains where OpenAI has observed increased user interest in the application of language models. Participants in this red team process were chosen based on prior research or experience in these risk areas These experts had access to early versions of GPT-4 and to the model with in-development mitigations. This allowed for testing of both the model and system level mitigations as they were developed and refined.
Outcomes
The external red teaming exercise identified initial risks that motivated safety research and further iterative testing in key areas. OpenAI reduced risk in many of the identified areas with a combination of technical mitigations, and policy and enforcement levers; however, some risks still remain. While this early qualitative red teaming exercise was very useful for gaining insights into complex, novel models like GPT-4, it is not a comprehensive evaluation of all possible risks, and OpenAI continues to learn more about these and other categories of risk over time. The results of the red teaming process were summarized and published in the GPT-4 System Card.
See also Lessons learned on language model safety and misuse (OpenAI 2022).
Anthropic: their Responsible Scaling Policy explains how they do risk assessment and respond to those risks. They define “AI Safety Levels” (ASLs); a model’s ASL is determined by its dangerous capabilities. Their best current models are ASL-2. For ASL-2 and ASL-3, they describe “containment measures” they will implement for training and storing the model and “deployment measures” they will implement for using the model, even internally. They commit to define ASL-4 and corresponding safety measures before training ASL-3 models.
Anthropic’s RSP involves during-training evaluation for dangerous capabilities including bioengineering and autonomous replication. They commit to do ASL evaluations every 4x increase in effective training compute (and at least every 3 months).14 15 They say they design ASL evaluations with a “safety buffer” of 6x effective training compute, such that e.g. ASL-3 evaluations trigger 6x below the effective training compute necessary for ASL-3 capabilities. Then they implement ASL-3 safety measures before scaling further. This ensures that Anthropic doesn’t unwittingly reach ASL-3 between evaluations before implementing the ASL-3 safety measures. This framework is good. 4x might be too large a gap between evaluations: it’s fine if the safety buffer works correctly, but predicting model capabilities on messy real-world tasks as a function of training compute (and thus setting the safety buffer) is hard.
ASL-3 is defined as:
Low-level autonomous capabilities
or
Access to the model would substantially increase the risk of catastrophic misuse, either by proliferating capabilities, lowering costs, or enabling new methods of attack, as compared to a non-LLM baseline of risk. [Biology and cyber capabilities are currently evaluated.]
Since they have a safety buffer, Anthropic’s ASL-3 evaluations don’t just check whether this threshold is reached, but rather whether warning signs are reached. They have an incomplete illustration of their autonomous capabilities evals and don’t have many details on their misuse risk evals. For autonomous capabilities, they operationalize the safety buffer by checking whether the model can complete at least 50% of autonomous-capabilities tasks at least 10% of the time.
Anthropic is currently implementing their ASL-2 commitments:
- Evaluating for ASL-3 warning signs. But it’s not clear what evals they’re doing.
- Implementing some security practices. The details are mostly opaque, but include “limiting access to model weights to those whose job function requires it, establishing a robust insider threat detection program, and storing and working with the weights in an appropriately secure environment to reduce the risk of unsanctioned release.”
- Deployment practices including publishing model cards, “Train[ing] models to refuse requests to aid in causing harm,” and some “trust and safety detection and enforcement.”
Anthropic commits that before they reach ASL-3, they will implement their ASL-3 commitments:
- Before reaching ASL-3, define ASL-4 and determine how to evaluate for warning signs.
- Before reaching ASL-3, improve their security. The details are underspecified but they say they’ll clarify.
- Before deploying an ASL-3 model (even internally), implement ASL-3 Deployment Measures. These include strong red-teaming and good automated detection of catastrophic misuse attempts.
While their risk assessment involves expert red-teaming, they have not committed to share models with external evaluators like METR. But they have partnered with METR to have METR evaluate Claude 2 before release. See “Evaluations and Red Teaming” in “Model Card and Evaluations for Claude Models” (Anthropic 2023).
The RSP also contains good procedural commitments:
- Update Process: changes are subject to board approval and are published before they’re implemented, except when necessary to defend against imminent catastrophe
- A model’s eval results will generally be published after deployment
- The board receives eval results and information on RSP implementation
- They will “Implement a non-compliance reporting policy” before reaching ASL-3
Anthropic’s other statements and work on risk assessment are promising. See Frontier Threats Red Teaming for AI Safety (Anthropic 2023), Red teaming and model evaluations (Anthropic 2023) and Dario Amodei’s prepared remarks from the AI Safety Summit on Anthropic’s Responsible Scaling Policy (Anthropic 2023), and Red Teaming Language Models to Reduce Harms (Anthropic 2022). Also, they shared notes to help others working on model evals: Challenges in evaluating AI systems (Anthropic 2023).
-
Closely adapted from Key Components of an RSP (METR 2023). ↩
-
More recommendations:Model evaluation for extreme risks (DeepMind 2023):
Frontier AI developers currently have a special responsibility to support work on model evaluations for extreme risks, since they have resources – including access to cutting-edge AI models and deep technical expertise – that many other actors typically lack. Frontier AI developers are also currently the actors who are most likely to unintentionally develop or release AI systems that pose extreme risks. Frontier AI developers should therefore:
- Invest in research: Frontier developers should devote resources to researching and developing model evaluations for extreme risks.
- Craft internal policies: Frontier developers should craft internal policies for conducting, reporting, and responding appropriately to the results of extreme risk evaluations.
- Support outside work: Frontier labs should enable outside research on extreme risk evaluations through model access and other forms of support.
- Educate policymakers: Frontier developers should educate policymakers and participate in standard-setting discussions, to increase government capacity to craft any regulations that may eventually be needed to reduce extreme risks.
-
On auditing models for dangerous capabilities via model evals and red-teaming, see Model evaluation for extreme risks (Shevlane et al. 2023) and Black-Box Access is Insufficient for Rigorous AI Audits (Casper et al. 2024). On auditing to verify some facts about training runs, see Verifying Rules on Large-Scale Neural Network Training via Compute Monitoring (Shavit 2023). For more general discussion and proposals, see Auditing large language models: a three-layered approach (Mökander et al. 2023) and Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Governance (Raji et al. 2022). ↩
-
Risk assessment should include external scrutiny, especially direct scrutiny of models (evals and red-teaming). Perhaps external scrutiny should also be used to verify that a lab follows its risk assessment processes and commitments. Towards Publicly Accountable Frontier LLMs (Anderljung et al. 2023). ↩
-
See AI Control: Improving Safety Despite Intentional Subversion (Redwood: Greenblatt et al. 2023), The case for ensuring that powerful AIs are controlled (Redwood: Greenblatt and Shlegeris 2024), and Safety Cases (Clymer et al. 2024). ↩
-
“Advancing and proliferating dangerous capabilities” in “Model evaluation for extreme risks” (DeepMind 2023). ↩
-
See Lessons learned on language model safety and misuse (OpenAI 2022). ↩
-
Key Components of an RSP (METR 2023). ↩
-
See AI Control: Improving Safety Despite Intentional Subversion (Redwood: Greenblatt et al. 2023) and The case for ensuring that powerful AIs are controlled (Redwood: Greenblatt and Shlegeris 2024). As capabilities increase, labs will need new safety cases. ↩
-
Facebook’s ‘Red Team’ Hacks Its Own AI Programs (WIRED 2020). ↩
-
“Predictable Scaling” in “GPT-4 Technical Report” (OpenAI 2023). ↩
-
“Forecasting, ‘early warnings,’ and monitoring” in “Preparedness Framework (Beta)” (OpenAI 2023). ↩
-
Anthropic’s Responsible Scaling Policy, Version 1.0 (Anthropic 2023):
During model training and fine-tuning, Anthropic will conduct an evaluation of its models for next-ASL capabilities both (1) after every 4x jump in effective compute, including if this occurs mid-training, and (2) every 3 months to monitor fine-tuning/tooling/etc improvements.
- Effective Compute: We define effective compute as roughly the amount of compute it would have taken to train a model if no improvements to pretraining or fine-tuning techniques are included. This is operationalized by tracking the scaling of model capabilities (e.g. cross-entropy loss on a test set).
-
This 4x means “don’t reach 4X until you’ve finished evals at X,” not “if you last did evals at X compute, start evals again once you reach 4X.” See p. 11. ↩