Many AI safety folks think that METR is close to the labs, with ongoing relationships that grant it access to models before they are deployed. This is incorrect. METR (then called ARC Evals) did pre-deployment evaluation for GPT-4 and Claude 2 in the first half of 2023, but it seems to have had no special access since then.1 Other model evaluators also seem to have little access before deployment.
Clarification: there are many kinds of audits. This post is about model evals for dangerous capabilities. But I’m not aware of the labs using other kinds of audits to prevent extreme risks, excluding normal security/compliance audits.
Frontier AI labs’ pre-deployment risk assessment should involve external model evals for dangerous capabilities.2 External evals can improve a lab’s risk assessment and—if the evaluator can publish its results—provide public accountability.
The evaluator should get deeper access than users will get.
- To evaluate threats from a particular deployment protocol, the evaluator should get somewhat deeper access than users will — then the evaluator’s failure to elicit dangerous capabilities is stronger evidence that users won’t be able to either.3 For example, the lab could share a version of the model without safety filters or harmlessness training, and ideally allow evaluators to fine-tune the model.
- To evaluate threats from model weights being stolen or released, the evaluator needs deep access, since someone with the weights has full access.
The costs of using external evaluators are unclear.
- Anthropic said that collaborating with METR “requir[ed] significant science and engineering support on our end”; it has not clarified why. And even if providing deep model access or high-touch support is a hard engineering problem, I don’t understand how sharing API access—including what users will receive and a no-harmlessness no-filters version—could be.
- Sharing model access pre-deployment increases the risk of leaks, including of information about products (modalities, release dates), information about capabilities, and demonstrations of models misbehaving.
Independent organizations that do model evals for dangerous capabilities include METR, the UK AI Safety Institute (UK AISI), and Apollo. Based on public information, there’s only one recent instance of a lab giving access to an evaluator pre-deployment—Google DeepMind sharing with UK AISI—and that sharing was not very deep (see below).
What the labs say they’re doing on external evals before deployment:
- DeepMind4
- It shared Gemini 1.0 Ultra and Gemini 1.5 Pro with unspecified external groups apparently including UK AISI to test for dangerous capabilities before deployment. But it didn’t share deep access: it only shared a system with safety fine-tuning (and for 1.0 Ultra, safety filters) and it didn’t allow evaluators to fine-tune the model. It sharedhigh-level results from 1.5 Pro testing.
- Its Frontier Safety Framework says “We will . . . explore how to appropriately involve independent third parties in our risk assessment and mitigation processes.”
- Anthropic
- OpenAI
- Currently nothing
- Its Preparedness Framework does not mention external evals before deployment. The closest thing it says is “Scorecard evaluations (and corresponding mitigations) will be audited by qualified, independent third-parties.”
- It shared GPT-4 with METR in the first half of 2023
- It said “We think it’s important that efforts like ours submit to independent audits before releasing new systems; we will talk about this in more detail later this year.” That was in February 2023; I do not believe it elaborated (except to mention that it shared GPT-4 with METR).
- All notable American labs joined the White House voluntary commitments, which include “external red-teaming . . . in areas including misuse, societal risks, and national security concerns, such as bio, cyber, [autonomous replication,] and other safety areas.” External red-teaming does not substitute for external model evals; see below.
- DeepMind said it did lots of external red-teaming for Gemini.
- Anthropic said it did external red-teaming for CBRN capabilities. It has also written about using external experts to assess bio capabilities.
- OpenAI said it did lots of external red-teaming for GPT-4. It has also writtenabout using external experts to assess bio capabilities.
- Meta said it did external red-teaming for CBRNE capabilities.
- Microsoft said it’s “building out external red-teaming capacity . . . . The topics covered by such red team testing will include testing of dangerous capabilities, including related to biosecurity and cybersecurity.”
Related miscellanea:
External red-teaming is not external model evaluation. External red-teaming generally involves sharing the model with several people with expertise relevant to a dangerous capability (e.g. bioengineering) who open-endedly try to elicit dangerous model behavior for ~10 hours each. External model evals involves sharing with a team of experts at eliciting capabilities, to perform somewhat automated and standardized evals suites that they’ve spent ~10,000 hours developing.
Labs’ commitments to share pre-deployment access with UK AISI are unclear.5
This post is about sharing model access before deployment for risk assessment. Labs should also share deeper access with safety researchers (during deployment). For example, some safety researchers would really benefit from being able to fine-tune GPT-4, Claude 3 Opus, or Gemini, and my impression is that the labs could easily give safety researchers fine-tuning access. More speculatively, interpretability researchers could send a lab code and the lab could run it on private models and send the results to the researchers, achieving some benefits of releasing weights with much less downside.6
Everything in this post applies to external deployment. It will also be important to do some evals during training and before internal deployment, since lots of risk might come from weights being stolen or the lab using AIs internally to do AI development.
Labs could be bound by external evals, such that they won’t deploy a model until a particular eval says it’s safe. This seems unlikely to happen (for actually meaningful evals) except by regulation. (I don’t believe any existing evals would be great to force onto the labs, but if governments were interested, evals organizations could focus on creating such evals.)
Thanks to Buck Shlegeris, Eli Lifland, Gabriel Mukobi, and an anonymous human for suggestions. They don’t necessarily endorse this post.
Subscribe on Substack. Discuss on LessWrong.
-
METR’s homepage says:
We have previously worked with Anthropic, OpenAI, and other companies to pilot some informal pre-deployment evaluation procedures. These companies have also given us some kinds of non-public access and provided compute credits to support evaluation research.
We think it’s important for there to be third-party evaluators with formal arrangements and access commitments - both for evaluating new frontier models before they are scaled up or deployed, and for conducting research to improve evaluations.
We do not yet have such arrangements, but we are excited about taking more steps in this direction.
-
GovAI: Schuett et al. 2023. See also DSIT 2023, Brundage et al. 2020, AI Safety Summit 2023, and Anthropic 2024. ↩
-
Idea: when sharing a model for external evals or red-teaming, for each mitigation (e.g. harmlessness fine-tuning or filters), either disable it or make it an explicit part of the safety case for the model. Either claim “users can’t effectively jailbreak the model given the deployment protocol” or disable. Otherwise the lab is just stopping the bioengineering red-teamers from eliciting capabilities with mitigations that won’t work against sophisticated malicious users. ↩
-
A previous version of this post omitted discussion of external testing of Gemini 1.5 Pro. Thanks to Mary Phuong for pointing out this error. ↩
-
Politico and UK government press releases report that AI labs committed to share pre-deployment access with UK AISI. I suspect they are mistaken and these claims trace back to the UK AI safety summit “safety testing” session, which is devoid of specific commitments. I am confused about why the labs have not clarified their commitments and practices. ↩
-
See Shevlane 2022. See also Bucknall and Trager 2023 and Casper et al. 2024. ↩