Deployment

When a dangerous model is deployed, it will pose misalignment and misuse risks. Even before dangerous models exist, deploying models on dangerous paths can accelerate and diffuse progress toward dangerous models.

Labs have three big choices about deployment: whether to deploy a model, what access to provide to the model, and what safety scaffolding to put the model in. They should make deployment decisions conditional on risk assessment results, release via API or similar and generally not publish capabilities research, and prepare to use various safety scaffolding techniques.

Currently at frontier labs:

  • OpenAI and Anthropic have made commitments not to deploy models past certain risk-assessment thresholds. Others have not.
  • OpenAI and Anthropic don’t release weights of language models near the open-weights frontier. DeepMind pushes the open-weights frontier but doesn’t release all of its weights. Microsoft and Meta release all of their weights.
  • The labs use generally filter inputs in their products, but not much more sophisticated safety scaffolding techniques.

Other goals for labs include achieving strong adversarial robustness, being prepared to respond well to misalignment and misuse, and implementing a bug bounty for model issues.

Releasing models well (46%)

Labs can choose how widely to release1 their systems; in particular, they can deploy internally, release via API or similar, or release the weights. And for a given release strategy—especially given API-ish release—labs can take some actions to promote safety.

Three main considerations determine the safety effects of frontier model releases: the direct effects of the release (especially the risk of enabling catastrophic misuse or the model taking over the world), how the release boosts capabilities research on dangerous paths outside the lab, and how the release boosts safety research outside the lab.

For a model similarly powerful or more powerful than the most powerful previously released models, the release strategy should focus on avoiding doing harm directly and avoiding boosting capabilities research on dangerous paths outside the lab, but boosting safety research. The lab should not deploy until it has adequately responded to its pre-deployment risk assessment. It should share with safety evaluators, mitigate risks from internal deployment, and optionally release broadly via API or similar. It should use structured access, controlling users’ access to systems and information to boost safety.

What labs should do

Labs can deploy their AI systems in many ways.2 Releasing more powerful systems and releasing in a more open manner tends to advance others’ capabilities on similar paths to the released model. This is bad because it tends to decrease the time until dangerous capabilities appear and the lead time of leading labs over others. So when releasing powerful systems on dangerous paths, labs should avoid advancing others’ capabilities, or roughly release in a more cautious and closed manner.

Labs should not open-source their models on dangerous paths. Release decisions are more complicated than “open-source or not,” and we do not have one-size-fits-all advice on release; labs should release in a manner informed by an assessment of how release would advance others’ capabilities.

Labs should differentially share safer kinds of systems to differentially advance others’ progress on those kinds of systems.

Discussion of openness

Coming soon.

Recommendations

(This applies only to models on dangerous paths.)

Open-Sourcing Highly Capable Foundation Models (GovAI: Seger et al. 2023):

  1. Developers and governments should recognise that some highly capable models will be too risky to open-source, at least initially. These models may become safe to open-source in the future as societal resilience to AI risk increases and improved safety mechanisms are developed.
  2. Decisions about open-sourcing highly capable foundation models should be informed by rigorous risk assessments. In addition to evaluating models for dangerous capabilities and immediate misuse applications, risk assessments must consider how a model might be fine-tuned or otherwise amended to facilitate misuse.
  3. Developers should consider alternatives to open-source release that capture some of the same [distributive, democratic, and societal] benefits, without creating as much risk. Some promising alternatives include gradual or “staged” model release, model access for researchers and auditors, and democratic oversight of AI development and governance decisions.
  4. Developers, standards setting bodies, and open-source communities should engage in collaborative and multi-stakeholder efforts to define fine-grained standards for when model components should be released. These standards should be based on an understanding of the risks posed by releasing (different combinations of) model components.

Towards best practices in AGI safety and governance (GovAI: Schuett et al. 2023):

  • Safety restrictions. AGI labs should establish appropriate safety restrictions for powerful models after deployment (e.g. restrictions on who can use the model, how they can use the model, and whether the model can access the internet).”
  • No [unsafe] open-sourcing. AGI labs should not open-source powerful models, unless they can demonstrate that it is sufficiently safe to do so.”
  • Staged deployment. AGI labs should deploy powerful models in stages. They should start with a small number of applications and fewer users, gradually scaling up as confidence in the model’s safety increases.”
  • API access to powerful models. AGI labs should strongly consider only deploying powerful models via an application programming interface (API).”
  • KYC screening. AGI labs should conduct know-your-customer (KYC) screenings before giving people the ability to use powerful models.” Or they should generally give more access to screened users, at least.
  • Researcher model access. AGI labs should give independent researchers API access to deployed models.”
  • “AGI labs should limit API access to approved and vetted applications to foreclose potential misuse and dual use risks.”

Structured access (Shevlane 2022):3

  • “Cloud-based deployment”: don’t give the software to the user.
  • “Procedural use controls”: set rules about uses and users (and monitor uses and update the rules).
  • Use “checks against model stealing (i.e., where users train a new model by using the original model’s outputs as data)”; maybe allow fine-tuning but maintain control of the model.
  • Maybe give trusted users deeper model access for approved research, or let some users do some monitored interpretability, where they “submit their code and are sent back the results.”

A lab can also restrict the access of users who seem likely to attempt to misuse its systems.4

Additionally, labs should support good norms for releasing models on dangerous paths.5

Labs should be transparent about what version of a model or product is being used. When they update their models or products, they should sometimes provide notes about the updates. They should sometimes grant safety researchers and auditors access to past versions of their models or systems.

Before pushing large updates to deployed models, and periodically (e.g. at least every three months), labs should repeat their pre-deployment risk assessment process.6

Evaluation

Deployment decision (24%):

  • (2/3) Commit to do pre-deployment risk assessment and not deploy models with particular dangerous capabilities (including internal deployment), at least until implementing particular safety practices or passing control evaluations. (Or commit to use another kind of safety case before deployment.)
    • (1/3) … and commit to do risk assessment during deployment, before pushing major changes and otherwise at least every 3 months (to account for improvements in fine-tuning, scaffolding, plugins, prompting, etc.), and commit to implement particular safety practices or partially undeploy dangerous models if risks appear.

Release method (22%):

Structured access:

  • (27/100) not releasing dangerous model weights (or code): the lab should deploy its most powerful models privately or release via API or similar, or at least have some specific risk-assessment-result that would make it stop releasing model weights. (Based on the lab’s policy, or if that’s unclear, their last major LM project.)
    • (27/100) … and effectively avoid helping others create powerful models (via model inversion or imitation learning). It’s unclear what practices labs should implement, so for now we use the low bar of whether they say they do anything to prevent users from determining model weights, using model outputs to train other models, and determining training data. (Beyond terms of service.)
    • (22/100) … and limit deep access to powerful models. It’s unclear what labs should do, so for now we check whether the lab disables or has any limitations on access to each of logprobs, embeddings at arbitrary layers, activations, and fine-tuning.
      • (14/100) … and differential access: systematically give more access to safety researchers and auditors.
    • (0/100) … and terms of service: have some rules about model use, including rules aimed to prevent catastrophic misuse, model duplication, or otherwise using the model to train other models.

Staged release:

  • (10/100) For the lab’s most powerful closed models, disable access to fine-tuning or powerful scaffolding when initially releasing them. Use narrow release to identify and fix issues. (Or initially give more access only to trusted users, or sufficiently few untrusted users that you can monitor them closely (and actually monitor them closely).) (Or at least commit to do this for sufficiently powerful models.) And when releasing model weights, if pushing the open-weights frontier substantially, push it gradually first, and ideally release via API or similar first; if not, commit to stage releases that do push the open-weights frontier substantially.

Sources

Open-Sourcing Highly Capable Foundation Models (GovAI: Seger et al. 2023).

Structured access (Shevlane 2022).

Structured access for third-party research on frontier AI models (Bucknall and Trager 2023).

Towards best practices in AGI safety and governance (GovAI: Schuett et al. 2023).

How Does Access Impact Risk? Assessing AI Foundation Model Risk Along a Gradient of Access (Institute for Security and Technology: Brammer et al. 2023).

The Gradient of Generative AI Release (Solaiman 2023).

AI capabilities can be significantly improved without expensive retraining (Davidson et al. 2023).

Responsible Scaling Policies and Key Components of an RSP (METR 2023).

Safety Cases (Clymer et al. 2024).

Managing catastrophic misuse without robust AIs (Redwood: Greenblatt and Shlegeris 2024) presents a proposal for averting catastrophic misuse.

“Responsible capability scaling” in “Emerging processes for frontier AI safety” (UK Department for Science, Innovation and Technology 2023) recommends doing risk assessment before deployment, including pre-specifying risk thresholds and corresponding mitigations and committing to only deploy models after implementing the mitigations corresponding to their risk level. It also recommends staged release: “[deploying] models in small-scale or reversible ways before deploying models in large-scale or irreversible ways.”

“Preventing and monitoring model misuse” in “Emerging processes for frontier AI safety” (UK Department for Science, Innovation and Technology 2023) recommends “user-based API access restrictions”: remove access from suspicious users, use KYC to make that effective, and perhaps give more access to more trusted users.

Black-Box Access is Insufficient for Rigorous AI Audits (Casper et al. 2024).

On the Societal Impact of Open Foundation Models (CRFM: Kapoor et al. 2024).

PAI’s Guidance for Safe Foundation Model Deployment (PAI 2023). (Note that it does not attempt to address extreme risks; see “What AI safety risks does the Model Deployment Guidance seek to address?” in PAI’s Deployment Guidance for Foundation Model Safety.)

Strategic Implications of Openness in AI Development (Bostrom 2017).

Beyond “Release” vs. “Not Release” (Sastry 2021).

Open Foundation Models (CSET: Miller 2024).

“Model Access” in “Generative Language Models and Automated Influence Operations” (Goldstein et al. 2023).

Potential harms from model release:

What labs are doing

Microsoft: they don’t make their own flagship LMs. They seem to release the weights of their most powerful models: currently Phi-3; previously WizardLM-2, Phi-2, and Orca 2.

They have not yet created frontier models. They deploy GPT-4 via their chatbot Copilot (formerly known as Bing Chat). They deploy others’ frontier models—including GPT-4, Mistral Large, and Inflection-2.5—via their platform Azure. OpenAI seems to be required to share its model weights with Microsoft.

Microsoft and NVIDIA’s MT-NLG was not released.

Google DeepMind: we are not aware of a policy on model release.7

DeepMind released Gemini: for Gemini Pro, via API and Bard; for Gemini Ultra, via API (AI Studio, Vertex AI).

DeepMind released the weights of Gemma.

We do not believe that DeepMind has released Chinchilla or Gopher. DeepMind open-sourced AlphaFold 2.8 Google AI has released PaLM 2 via Bard, an API, chat in third-party applications, and more.

DeepMind does not have a Responsible Scaling Policy or other commitments to achieve specific safety goals or use specific safety practices before deploying models with specific dangerous capabilities, or to not deploy models with specific dangerous capabilities. “Responsible Capabilities Scaling” in “AI Safety Summit: An update on our approach to safety and responsibility” (DeepMind 2023) mentions many ways that DeepMind tries to be responsible, including Google’s “AI Principles,” Google’s “Responsible AI Council,” DeepMind’s “Responsible Development and Innovation team,” DeepMind’s “standardised Ethics and Safety Assessment,” and DeepMind’s “Responsibility and Safety Council.” But it’s not clear that any of these steps are effective for reducing catastrophic risk. For example, they say “we publish regular reports on our AI risk management approaches,” but the linked report doesn’t discuss catastrophic risk, model evals for dangerous capabilities, or safety techniques for high-stakes risks. They also say “Our evaluations for large language models currently deployed focus on content safety and fairness and inclusion, and we are also developing evaluations for dangerous capabilities (in domains such as biosecurity, persuasion and manipulation, cybersecurity, and autonomous replication and adaptation).” Work on these dangerous capability evaluations is great. They also say “One approach for implementing these principles that we are exploring is to operationalize proportionality by establishing a spectrum of categories of potential risk for different models, with recommended mitigations for each category.” Such a policy could be great.

DeepMind performed model evals for dangerous capabilities—in particular, cyberoffense, persuasion/deception, self-proliferation, situational awareness, and CBRN—on its Gemini models, but only after deploying them. The Gemma report says “we present comprehensive evaluations of safety and responsibility aspects of the models,” but doesn’t mention evals for dangerous capabilities.

“Responsible Deployment” in “Gemini” describes a reasonable-seeming process but generally omits details and includes little on detecting or mitigating extreme risks:

We develop model impact assessments to identify, assess, and document key downstream societal benefits and harms associated with the development of advanced Gemini models. . . . Areas of focus include: factuality, child safety, harmful content, cybersecurity, biorisk, representation and inclusivity. . . . Building upon this understanding of known and anticipated effects, we developed a set of “model policies” to steer model development and evaluations. Model policy definitions act as a standardized criteria and prioritization schema for responsible development and as an indication of launch-readiness. Gemini model policies cover a number of domains including: child safety, hate speech, factual accuracy, fairness and inclusion, and harassment. . . . To assess the Gemini models against policy areas and other key risk areas identified within impact assessments, we developed a suite of evaluations across the lifecycle of model development. . . . External evaluations are conducted by partners outside of Google to identify blindspots. External groups stress-test our models across a range of issues . . . . In addition to this suite of external evaluations, specialist internal teams conduct ongoing red teaming of our models across areas such as the Gemini policies and security. These activities include less structured processes involving sophisticated adversarial attacks to identify new vulnerabilities.

They say “We will share more details on [our approach to deployment] in an upcoming report.”

It’s not clear how DeepMind implements structured access, providing different kinds of access to different users. DeepMind does not seem to give deeper access to AI safety researchers and evaluators.

DeepMind gives access to a Gemini embedding model, but no other deep model access such as fine-tuning.

Google released Gemini 1.5 Pro and Gemini Ultra via a limited private API before public API.

Gemini Nano will be on Pixel phones, so it will probably be stealable, but maybe Nano isn’t important.

Meta AI: we are not aware of a policy on model release. They effectively published the weights of their flagship Llama 3 models, and also Llama 2 (including Code Llama), LLaMA, and OPT (including OPT-IML and BlenderBot 3). They open-sourced Galactica.9 They also have a chatbot, Meta AI, based on Llama 3.

Meta CEO Mark Zuckerberg has discussed how Meta thinks about sharing its model weights.10

OpenAI: we are not aware of a policy on model release. They have released GPT-4 via an API and ChatGPT, including shareable custom versions of ChatGPT GPT-4 has plugins including a browser and Python interpreter. OpenAI’s API permits fine-tuning GPT-3.5 and “GPT-4 fine-tuning is in experimental access.” OpenAI shares their models with Microsoft; Microsoft has released GPT-4 via an API and Copilot.

OpenAI’s Preparedness Framework (Beta) defines ‘low,’ ‘medium,’ ‘high,’ and ‘critical’ risk levels in each of several risk categories. “Only models with a post-mitigation score of ‘medium’ or below can be deployed.” This commitment covers external deployment; the framework doesn’t make commitments about internal deployment. The framework’s structure is pretty good, but we are concerned that the threshold for ‘high’ risk is too high; a model could be dangerous to deploy without reaching OpenAI’s threshold for ‘high’ risk. Note that the framework has not yet been implemented.

OpenAI has a strong partnership with Microsoft. The details are opaque, but OpenAI tentatively seems to be obligated to share its models with Microsoft until it attains “AGI,” “a highly autonomous system that outperforms humans at most economically valuable work.” (This is concerning because AI systems could cause a catastrophe with capabilities below that threshold.) OpenAI has left details of its relationship with Microsoft opaque, including its model-sharing obligations. Even if OpenAI deploys its models safely, sharing them with Microsoft would let Microsoft deploy them unsafely.

OpenAI and Microsoft have a “joint Deployment Safety Board . . . which approves decisions by either party to deploy models above a certain capability threshold,” including GPT-4 (OpenAI 2023; see also Microsoft 2023 and The Verge 2023). The details are unclear.

It’s not clear how OpenAI implements structured access, providing different kinds of access to different users. OpenAI does not seem to give deeper access they give to AI safety researchers and evaluators.

Deep model access: users can access the top 5 logprobs. Fine-tuning GPT-4 is not available via public API, but some users are permitted to do it, and anyone can fine-tune GPT-3.5.

ChatGPT has some memory. It’s in natural language and the user can edit it.

Anthropic: we are not aware of a policy on model release.11 An Anthropic cofounder says “we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk. In the case of difficult scenarios, we have a formal infohazard review procedure.” Their flagship model family, Claude 3, is released via public API and as a chatbot. Claude can use external tools.

Anthropic’s Responsible Scaling Policy defines “an ASL-3 model as one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost,” substantially “increase risk of misuse catastrophe” or “[show] early signs of autonomous self-replication ability.” They commit to evaluate their models during training to determine when they reach ASL-3. They commit to implement ASL-3 Deployment Measures before deploying ASL-3 models (even internally), including red-teaming to ensure the model poses very little catastrophic risk:

World-class experts collaborating with prompt engineers should red-team the deployment thoroughly and fail to elicit information at a level of sophistication, accuracy, usefulness, detail, and frequency which significantly enables catastrophic misuse. Misuse domains should at a minimum include causes of extreme CBRN risks, and cybersecurity. . . . If deployment includes e.g. a fine-tuning API, release of weights, or another modality that offers a broader surface area of model modification, red-teaming must include this use of these modalities and must still return a result of no practically important catastrophic misuse.

This commitment has an exception for “Tiered access.” ASL-3 deployment measures also include “Automated detection,” “Internal usage controls,” “Vulnerability and incident disclosure,” and “Rapid response to model vulnerabilities.”

It’s not clear how Anthropic implements structured access, providing different kinds of access to different users. Their RSP mentions “Tiered access.” Anthropic does not seem to give deeper access to AI safety researchers and evaluators.

Anthropic does not share deep access to its models via its public API, but it seems to allow some users to fine-tune its models.

Releasing Claude 3 pushed the frontier of public AI capabilities. This raised concerns about Dario Amodei’s possible commitment to not meaningfully advance the frontier. Anthropic has failed to clarify its current position on this topic.

Other labs:

See also Best Practices for Deploying Language Models (Cohere, OpenAI, and AI21 Labs 2022). See also Lessons learned on language model safety and misuse (OpenAI 2022).

Keeping capabilities research private (22%)

Labs should avoid publishing capabilities research on dangerous paths—including LMs—unless it is a necessary part of safety research or other competing paths are much more dangerous. Labs should publish some information about their frontier AI systems’ capabilities.

What labs should do

When a lab publishes capabilities research, that advances others’ relevant capabilities. For capabilities on dangerous paths, that’s bad.12 Labs should avoid publishing capabilities research on dangerous paths.

Publishing research on dangerous paths has some benefits for researchers, labs, the AI research community, and the world: researchers get status and career capital, labs attract talent, the community learns, and the world can get the benefits of more advanced technology. Publishing less can be a sacrifice and has some costs to others.13 Ideally labs/researchers could choose to keep work private without sacrificing researchers’ status and career capital.

But labs should publish some information about their frontier AI systems’ capabilities. It seems good on net for capabilities to be better understood more broadly, e.g. by AI developers, the ML community, and policymakers.14 For example, labs can publish tasks and evaluation frameworks, the results of evaluations of particular systems, and discussions of the qualitative strengths and weaknesses of particular systems.15

Perhaps insofar as there are multiple viable paths to powerful AI and systems on some paths are more safe-by-default or easily-alignable than others, labs should advance the safer paths by sharing capabilities research on those paths.16 But it’s not clear whether some viable path is substantially safer than others.17 Sharing some kinds of capabilities research might also reduce overhang and slow progress later.

Labs should set their culture and internal information-siloing practices to avoid publishing or leaking dangerous research. We don’t know how to do this.

Labs should encourage other frontier AI labs to commit to not publish capabilities research (on LMs and other dangerous paths) and to share their thinking on this topic.

Labs should support an external prepublication review system for potentially-dangerous categories of AI research.18

Recommendations

  • Publicly explain that research can be dangerous and commit to not publish capabilities research on dangerous paths19
    • In particular, something-like-LMs, and some misuse-y stuff like some bio capabilities, and some other acceleratory stuff
  • Internal review before publishing research
    • What should review do? Assess what’s dangerous. Ensure that dangerous stuff is removed from publications + inherently dangerous work (e.g. drawing attention to an idea is dangerous) is not published, or that research is released only after it’s no longer infohazardous.
    • Labs should review all potential-releases before publication, “weighing expected safety benefit against capabilities/acceleratory risk. In the case of difficult scenarios, [they should] have a formal infohazard review procedure.”20 They should share the basics of what they consider dangerous and their publication policy.
  • External review before publishing research
    • And coordinate with other labs and research groups
    • And help government coordinate and mandate external review
  • Maybe create a third-party institution to determine what is safe to publish21
  • Internal culture

Of course, the ultimate goal of these policies is to avoid publishing dangerous research. (A side goal is nudging other labs to publish more responsibly and other actors to help.)

Labs should publish little about their major LM projects. They should not publish weights, hyperparameters, code, architecture details, or lessons learned. They should not publish their training dataset, except particular corpuses that they recommend including or excluding for safety. In addition to LMs, labs should avoid publishing research on:

  • Improving AI hardware or software progress; see Examples of AI Improving AI
  • Dangerous biological and chemical engineering; fortunately frontier AI labs don’t seem to be working on this
  • General RL agents (tentatively)

Evaluation

We evaluate how well labs avoid diffusing dangerous capabilities research. We exclude model weights (and related artifacts like code); those are considered in “Releasing models.”

Publication policy/process (for LLM research)

  • (4/15) Policy: the lab should say that it doesn’t publish dangerous or acceleratory research/artifacts and say it has a policy/process to ensure this.
    • (2/15) … and share policy details.

Track record for recent major LM projects: not sharing in practice (via publication or leaking—we can at least notice public leaks):

  • (2/15) architecture (unless it’s an unusually safe architecture).
  • (2/15) dataset (except particular corpuses to include or exclude for safety).
  • (2/15) lessons learned.

(3/15) Track record for other LM capabilities research or other dangerous or acceleratory work: not sharing in practice. Evaluated holistically.

From outside labs, we often don’t know what isn’t published and so can’t give labs credit for that. Sometimes we do—in particular, we know that labs could share details of their models’ architecture and training, even if they don’t.

Sources

PAI’s Publication Norms for Responsible AI project

Strategic Implications of Openness in AI Development (Bostrom 2017). Sharing safety research is good; capabilities bad.

Towards best practices in AGI safety and governance (GovAI: Schuett et al. 2023). “Internal review before publication. Before publishing research, AGI labs should conduct an internal review to assess potential harms.” 75% strongly agree; 24% somewhat agree; 2% strongly disagree; n=51.

Reducing malicious use of synthetic media research (Ovadya and Whittlestone 2019). [Good discussion in general and on release/publication options; recommendations mostly aren’t aimed at labs.]

The Offense-Defense Balance of Scientific Knowledge: Does Publishing AI Research Reduce Misuse? (Shevlane and Dafoe 2020). [Good discussion in general.]

Forbidden knowledge in machine learning (Hagendorff 2020). [I haven’t read this.]

Maybe Publication decisions for large language models, and their impacts in Understanding the diffusion of large language models (Cottier 2022).

Statement of Support for Meta’s Open Approach to Today’s AI (Meta 2023). [Pro-sharing.]

What labs are doing

(Again: “capabilities research” means “capabilities research on dangerous paths.”)

Microsoft: they publish capabilities research on their models, such as Phi-3. They also publish other capabilities research; see their research page.

Google DeepMind: we are not aware of a publication policy.22 They seem to publish most of their capabilities research. They published technical details on their flagship language models, Gemini, as well as Gemma and PaLM 2. They also recently published research on distributed training (DiLoCo) and scaling multimodal models (Mirasol3B). They historically published lots of capabilities research.23 Their publications on AlphaCode 2, FunSearch, AlphaTensor and AlphaDev may advance progress toward tools that can differentially boost dangerous research; we don’t know.

Meta AI: we are not aware of a publication policy. They strongly favor publishing research and sharing models.24 They seem to publish lots of capabilities research.25 In particular, their papers on their LLaMA and Llama 2 models have substantial capabilities research, and they plan to publish the Llama 3 paper. Note.26

OpenAI: we are not aware of a publication policy. They used to publish lots of capabilities research,27 but tentatively seem to have published very little capabilities research since June 2022.28 In particular, GPT-4’s technical report and system card seems to exclude capabilities research.29 But stuff leaks, e.g. architecture rumors (e.g. 1, 2, 3). OpenAI cofounder and Chief Scientist Ilya Sutskever said open sharing of AI research is a mistake.30

Anthropic: we believe that Anthropic has not published capabilities research.31 See its Research. In particular, of Anthropic’s research related to its Claude models, it has only published safety and evaluation research.32 Anthropic says “We generally don’t publish [capabilities research] because we do not wish to advance the rate of AI capabilities progress.” And an Anthropic cofounder says “we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk. In the case of difficult scenarios, we have a formal infohazard review procedure.”

Adversarial robustness (0%)

By default, ML systems are vulnerable to a variety of adversarial attacks:

  • Jailbreaking
    • Via prompting
    • Via fine-tuning
  • Prompt injection
  • Extraction of some training data
  • Data poisoning (or backdooring or trojans)

Mitigating these attacks requires responding at many stages, including adversarial training, structured access to avoid giving access excessively, inference-time monitoring via safety scaffolding, and maybe structured access to prevent adversaries from iterating on adversarial attacks.

What labs should do

Labs should make their systems robust to adversarial attacks (and perhaps resist particular dangerous applications). Mostly this is an open problem that labs should solve.

Evaluation

None. We are not aware of a decent way to measure the adversarial robustness of a model or a model-plus-scaffolding, so we do not attempt to do so.

Sources

General

Jailbreak prompts

Jailbreak fine-tuning

Prompt injection

Extraction of training data

Trojans

What labs are doing

Microsoft: nothing.

Google DeepMind: the Gemini report (DeepMind 2023) says “specialist internal teams conduct . . . . sophisticated adversarial attacks to identify new vulnerabilities,” but has no details on adversarial robustness. Google’s AI red team report (Google 2023) has good discussion of attacks on AI systems, but doesn’t have solutions.

DeepMind’s old Research page is gone, and their new site doesn’t seem to have one. So while we are not aware of DeepMind research on adversarial robustness in language models, we aren’t sure.

Meta AI: the Llama 2 report (Meta AI 2023) says they did red-teaming including for jailbreak prompts, but doesn’t detail how they improved adversarial robustness. Safety techniques like fine-tuning and RLHF are greatly undermined by their practice of sharing weights: users can easily reverse that work (and share the modified weights).

Meta AI’s Publications do not seem to include work on adversarial robustness in language models.

OpenAI: The GPT-4 Technical Report (OpenAI 2023) mentions jailbreaks but doesn’t detail OpenAI’s work to improve robustness, nor does it mention other adversarial attacks. It says they “reduced the surface area of adversarial prompting or exploits (including attacks sometimes referred to as ‘jailbreaks’) that the model succumbs to.” And: “To improve the model’s robustness, we collect ranking data from labelers who attempt to circumvent the desired GPT-4-launch behavior. Training on this data improves model robustness but does not fully solve the problem of ‘jailbreaks’ leading to harmful content.”

OpenAI’s published Research does not seem to include work on adversarial robustness in language models.

Anthropic: The Claude 3 report says little about adversarial robustness. Red Teaming Language Models to Reduce Harms (Anthropic: Ganguli et al. 2022) investigates some basic techniques for improving adversarial robustness. Anthropic’s published Research seems to include little other work on adversarial robustness.

Deployment protocol (28%)

This applies to frontier LMs/systems/products. For other models/systems/products, labs need not have all of these policies; less powerful models are much less dangerous and require less monitoring.

A protocol is “a plan for the training, evaluation and deployment of some powerful AIs.” The deployment part of a protocol can involve automated monitoring, human review, and other scaffolding (to route inputs to different models and limit untrusted models). Deployment protocols’ goal is to make catastrophic AI misbehavior and catastrophic misuse attempts less likely to occur and more likely to be detected.

(This all requires that models are deployed internally or released via API or similar; it is not possible for models whose weights are released.)

Safety scaffolding (23%)

Labs can use scaffolding—putting language models in an environment, especially to modify their inputs or outputs or make other model calls—to improve safety.

Safety scaffolding is applicable to models with dangerous capabilities—that is, models that could cause a catastrophe if misaligned or misuse. It applies to internal deployment as well as external release. Misuse is a much lesser concern in internal deployment, but scheming—pretending to be safe in order to gain power later—may be greater, since an internally-deployed scheming AI may be well-positioned to hack its datacenter or exfiltrate its weights.

What labs should do

Labs’ tools include:

  • Trusted monitoring.33 Use trusted models (and perhaps simple tools like regular expressions or checking perplexity, and perhaps complex tools like interpretability tools) to determine which inputs and outputs are suspicious (in terms of AI scheming or misuse) and should be reviewed by humans. (This could include classifiers as well as LMs.) (Whether the output should be shown to the user and other models or not depends on the threat model.)
  • Untrusted monitoring.34 Same but with untrusted models (because they are more powerful).
  • Redaction/paraphrasing of inputs and outputs.35 Paraphrase inputs and outputs to make it harder for the model to receive or send steganographic messages. Redact unnecessary information from inputs to make it harder for models to distinguish testing from deployment or otherwise coordinate with each other.
  • Trusted editing/replacement of outputs.36 Use trusted monitoring; have relatively suspicious outputs edited or replaced by a trusted model.
  • Route queries intelligently.37 Use a trusted model to choose the “model to respond to a particular query, by predicting both the performance improvement and risk associated with using an untrusted model.” (This requires a lab to have multiple models that could reasonably respond to an input — either totally different models or versions of a model with/without dangerous capabilities.)

In addition to monitoring their powerful AI systems for scheming, misuse, and adversarial inputs, perhaps labs can also monitor for

  • Signs of planning, manipulation/deception, or situational awareness38
  • User-side post-training enhancements and integration with tools/plugins and scaffolding/apps

For internal deployment, labs should focus on the risk of AI scheming and the threat of scheming AI escaping the lab; for external release, the lab should focus relatively more on misuse.

Maybe AI labs should commit to a catch-then-shut-down plan, at least when there isn’t massive inaction risk.

For powerful models/systems, labs should eventually do automated monitoring on the inputs for all forward passes; log all inputs, outputs, and monitoring results; and retain them to enable investigation later.39 40

In addition to monitoring for scheming, undesired uses, and adversarial inputs, perhaps labs should monitor for planning, manipulation/deception, and situational awareness.

Evaluation

All of this requires that the lab doesn’t release model weights. So 1/4 credit for releasing model weights but using these techniques in the lab’s products/platforms, or more if the lab commits to stop releasing model weights and use these techniques if necessary.

(3/19) Filter out model inputs or outputs that enable misuse, in particular via cyberoffense and bioengineering. Ideally demonstrate that the protocol is very effective for averting misuse-enabling model outputs. Evaluated holistically.

Supervise potentially dangerous models:

  • (6/19) Use LM monitors to evaluate the suspiciousness of model outputs; have humans review the most suspicious outputs. Or at least be fully prepared to implement this and make a commitment about when the lab will implement it.
  • (2/19) Use coup probes: “train a probe on your AI to tell you whether it’s currently trying to [do a] catastrophic action,” training with “synthetic examples of coup attempts.”
    • When models are more powerful, the probe should be run on all generations. When the probe flags a generation, the generation should be stopped and go to human review. For now, it’s sufficient to get good at training the probe.

(3/19) Embed untrusted models in safety scaffolding to redact/paraphrase some inputs or outputs with the goal of making it harder for the model to distinguish deployment from testing or collude with other models. It suffices to build this and make some commitments about implementing it; actually implementing it is not necessary yet.

(3/19) Commitments:

  • Make specific commitments about deployment protocol safety techniques the lab will implement in the future, to be implemented by a certain time or as a function of model capabilities or risk assessment results.

(2/19) Clearly describe the safety-relevant parts of the lab’s deployment protocol (given that it’s nontrivial).41

Sources

Untrusted smart models and trusted dumb models (Shlegeris 2023):

the core building blocks of a protocol are going to be:

  • Untrusted models: Smart, fairly cheap.
  • Trusted models: Dumb, extremely cheap.
  • Humans. These are an expensive source of trusted labor. . . .

Some core “design patterns” which are enabled by access to these three components:

  • Trusted monitoring with escalation to humans. Suppose you’re using an untrusted model (let’s call it GPT-6) in an LLM agent scaffold which has access to running commands in a terminal. Every time the untrusted model proposes a shell command to run, you can have a trusted model review the context and the command and then occasionally escalate the code to be reviewed by a human before it is run. This is cheap, because trusted models are cheaper than untrusted models. This strategy makes it much harder for the untrusted model to cause a catastrophe by writing egregiously bad shell commands. (Of course, you still have to worry about the untrusted model writing shell commands that are catastrophic in ways that the trusted model missed, or in ways that the trusted model flags but the human auditor misses.)
  • Trusted bureaucratic structures. When running GPT-6 in an LLM agent, it might be nice to control its inputs and outputs by running them through various transformations implemented by a trusted model (let’s use GPT-4 for concreteness). Examples:
    • GPT-4 can redact information, to make it harder for GPT-6 to learn facts or coordinate with itself. In general, we might want to only give GPT-6 information on a need-to-know basis; this is enabled by GPT-4 redaction. (Redaction can improve resilience to deceptively aligned models substantially even in cases where it doesn’t work perfectly reliably.)
    • GPT-4 can paraphrase everything that GPT-6 writes before it is shown to other copies of GPT-6, which helps prevent steganographic communication, as investigated by Fabien Roger and Ryan Greenblatt.
    • GPT-4 can try to make reasonably intelligent dynamic decisions about which model to respond to a particular query, by predicting both the performance improvement and risk associated with using an untrusted model. For example, consider the domains of “math problems” and “writing networking configurations”: it’s much easier to cause a catastrophe by providing bad networking configurations, and math problems will benefit more from a smarter model.

Comment (Shlegeris 2023):

  • Using trusted dumb models to review the output of smart models. E.g. every time GPT-6 produces a bash command, you have GPT-4 read the whole context and flag it if it thinks the bash command is suspicious; when flagged, the command is reviewed by some slower catastrophe prevention process.
  • Using smart models to review their own outputs for catastrophicness, combined with various tricks to make collusion harder.
  • Limiting the power of smart models. E.g. if you’re worried about GPT-6 backdooring your code but you still want to get the benefit of it being smarter about code than GPT-4, you might have these models write code by GPT-6 writing instructions for the code that are then implemented by GPT-4. The hope here is that it’s easier for 6 to backdoor code than to write code-writing instructions that are themselves backdoored.
    • another proposal here is to have GPT-4 write the first draft of code and have GPT-6 write diffs.
  • For all of these, you can do various paraphrasing and censorship tricks to make it harder for the models to collude. See this ancient Paul Christiano post for descriptions of a bunch of ideas for techniques here.

AI Control: Improving Safety Despite Intentional Subversion (Redwood: Greenblatt et al. 2023).

The case for ensuring that powerful AIs are controlled (Redwood: Greenblatt and Shlegeris 2024). For recommendations, see especially Implications and proposed actions, Appendix: Control techniques from our paper, and Appendix: A long list of control techniques.

Coup probes (Roger 2023):

Here’s a simple safety intervention: train a probe on your AI to tell you whether it’s currently trying to [do a] catastrophic action. We call these coup probes.

This can be used to cheaply flag potential takeover attempts by early transformative AIs, as a complement to:

  • Asking a less capable trusted model to read the output and flag suspicious behavior - which might fail to catch subtle takeover attempts;
  • Asking a copy of the smart untrusted model - which might lie. . . .

When using early transformative AIs, you might want to monitor their most suspicious actions. How do you select the most suspicious actions? There are multiple possibilities, but one simple technique is coup probes:

  • Build a dataset of things the AI might do to try to take over (using the AI developer’s best guesses about what it might do), such as inserting critical backdoors and taking advantage of them, or persuading users to help it gain power;
  • Train probes in the early transformative AIs to distinguish between regular outputs, and the synthetic takeover attempts from your takeover dataset (without training the rest of the weights of the network). Probes are linear classifiers trained on the AIs’ activations - usually activations at different depths in the residual stream of Transformers.
  • Carefully audit the top-1% most suspicious actions of AIs according to the probes (and the other monitoring methods) before they are submitted to the scaffold through which the AIs interact with the codebase / its long-term memory.

Catching AIs red-handed (Redwood: Greenblatt and Shlegeris 2024).

Managing catastrophic misuse without robust AIs (Redwood: Greenblatt and Shlegeris 2024) presents a proposal for averting catastrophic misuse.

AI safety with current science (Shlegeris 2023).

“AI checks and balances” in “How might we align transformative AI if it’s developed very soon?” (Karnofsky 2022).

Buck Shlegeris says:

We have a wide variety of ways to make AI systems less capable but also less scary. Most alignment research is focused on pushing out the Pareto frontier here. But IMO a lot of value can come from techniques which allow us to choose the right point on this Pareto frontier for a particular application. It seems to me that different AI applications (where by “different AI applications” I’m including things like “different types of tasks you might ask your assistant to do”) have very different tradeoffs between capabilities and scariness. The classic example of this is that you plausibly shouldn’t put your scariest/smartest AIs in charge of running your nuclear weapon silos, because it seems like the returns to being super smart aren’t very high, and the model has particularly good opportunities to behave badly. On the other hand, galaxy brained AIs that were trained end-to-end to use inscrutable thoughts to have incomprehensible ideas can be used fruitfully on tasks where the stakes aren’t high or where verification is easy.

“Preventing and monitoring model misuse” in “Emerging processes for frontier AI safety” (UK Department for Science, Innovation and Technology 2023) recommends filtering model inputs and outputs to prevent adversarial attacks and misuse, safety-focused prompting. It also recommends noticing suspicious users and removing their access.

Chain-of-thought:

Activation engineering:

Paraphrasing to prevent steganography:

What labs are doing

Microsoft: they use content filtering but not to avert misuse.

Google DeepMind: we don’t know what inference-time scaffolding, filtering, and monitoring DeepMind uses.

The Gemini report mentions safety filters but doesn’t include details, in particular whether they aim to avert catastrophic misuse. Based on a Safety settings Google page, safety filters seem to target undesired content rather than dangerous capabilities.

“Reporting Structure for Vulnerabilities Found after Model Release and Post-Deployment Monitoring for Patterns of Misuse” in “AI Safety Summit: An update on our approach to safety and responsibility” (DeepMind 2023) just mentions monitoring public content and tracking societal impact, not inference-time monitoring or other safety scaffolding.

Meta AI: they release their model weights, so safety techniques they use in their products can’t prevent misuse with their models. That said, they have some tools for developers and use some filters in their chatbot, Meta AI:

We applied safeguards at the prompt and response level.

  • To encourage Meta AI to share helpful and safer responses that are in line with its guidelines, we implement filters on both the prompts that users submit and on responses after they’re generated by the model, but before they’re shown to a user.
  • These filters rely on systems known as classifiers that work to detect a prompt or response that falls into its guidelines. For example, if someone asks how to steal money from a boss, the classifier will detect that prompt and the model is trained to respond that it can’t provide guidance on breaking the law.
  • We have also leveraged large language models specifically built for the purpose of helping to catch safety violations.

OpenAI: we don’t know. “Post-deployment monitoring for patterns of misuse” in “OpenAI’s Approach to Frontier Risk” (OpenAI 2023) says little about inference-time monitoring: “We build internal measures designed to detect unexpected types of abuse [and] have processes to respond to them.” OpenAI has done research on detecting undesired content: A Holistic Approach to Undesired Content Detection in the Real World (Markov et al. 2023),42 Using GPT-4 for content moderation (Weng et al. 2023), and “Content Classifier Development” in “GPT-4 Technical Report” (OpenAI 2023).

OpenAI has worked on schemes with a combination of AI and human labor, but in the context of scalable oversight, not scaffolding or control: Self-critiquing models for assisting human evaluators (OpenAI: Saunders et al. 2022).

OpenAI helps developers filter content.

Anthropic: their Responsible Scaling Policy includes good but vague commitments for deploying ASL-3 models:

  • Automated detection: As a “defense in depth” addition to harm refusal techniques, classifiers or similar technologies should be deployed to detect and minimize risk of attempts at catastrophic misuse at the user or customer level. We commit to proactively address and mitigate these misuse threats, including working with appropriate law enforcement or national security authorities in the most serious cases. ASL-3+ model inputs and outputs should be retained for at least 30 days (by the customer as applicable, e.g. in their cloud account) to assist in the event of an emergency. We remain committed to data privacy, and will also explore the possibility of exceptions for some low-risk use-cases where safety can be guaranteed. This commitment applies only to ASL-3 models, and therefore does not include our current Claude 2 model or represent a change for our current products.
  • Internal usage controls: Automated detection as described above will also apply for internal usage within Anthropic or its distribution partners. For internal usage (e.g. red-teaming, RL training, or other R&D activities), generated outputs and corresponding inputs are also logged and retained for at least 30 days for all internal use, including when used for training. These logs are monitored for abnormal activity, including harmful use or unexpected attempts by AI systems to take real-world actions. Alarms are treated seriously and responded to promptly.

See also Preventing and monitoring model misuse (Anthropic 2023).

Anthropic says “We use . . . automated detection of CBRN and cyber risk-related prompts on all our deployed Claude 3 models.” And: “We use automated systems to detect violations of our AUP [acceptable use policy] as they occur in real time. User prompts that are flagged as violating the AUP trigger an instruction to our models to respond even more cautiously.” And: “[we use] continuous classifiers to monitor prompts and outputs for harmful, malicious use cases that violate our AUP.” (Details would be nice, in part to help others use similar techniques.)

Anthropic has done research on detecting undesired inputs: “Applying Out-of-Distribution Detection to Reject Strange or Harmful Requests” in “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback” (Bai et al. 2022). And on chain of thought: Measuring Faithfulness in Chain-of-Thought Reasoning (Lanham et al. 2023).

Anthropic has worked on setups with a combination of AI and human labor, but in the context of scalable oversight, not scaffolding or control: Measuring Progress on Scalable Oversight for Large Language Models (Anthropic: Bowman et al. 2022).

Responding to scheming (5%)

A lab could catch one of its AI systems scheming: pretending to be safe in order to gain power later. This could occur during pre-deployment evaluations or during deployment. If it does, the lab can leverage this event to improve safety by shutting down the relevant systems or at least using this knowledge to help prevent and detect future scheming. See Catching AIs red-handed (Redwood: Greenblatt and Shlegeris 2024).

Evaluation

Commit and have a plan that if the lab catches its models scheming:

  • (1/2) The lab will shut down some model access until verifying safety or fixing the issue.
  • (1/2) The lab will use particular good techniques to use that example of AIs attempting to cause a catastrophe to improve safety.

Sources

Catching AIs red-handed (Redwood: Greenblatt and Shlegeris 2024).

Scheming AIs (Carlsmith 2023).

The case for ensuring that powerful AIs are controlled (Redwood: Greenblatt and Shlegeris 2024).

Deployment corrections (IAPS: O’Brien et al. 2023).

What labs are doing

No labs have done anything on responding to scheming.

Anthropic has done some work on scheming (Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Anthropic: Hubinger et al. 2024)) and its Alignment Stress-Testing team may do more relevant work.

Responding to misuse (0%)

Labs should use scaffolding to monitor for attempted misuse of their systems. When they detect attempted misuse, they should ban the user. If necessary, they should update the system.

(Using scaffolding to detect attempted misuse complements using scaffolding to make misuse attempts less likely to succeed, which is discussed in Safety scaffolding.)

Labs should monitor for attempted misuse. They can use an ensemble of AI monitors (on inputs and outputs) to detect jailbreaking and content on dangerous topics (hacking, bioengineering and pathogens, etc.). Suspicious content is flagged for human review (and the output can be shown to the user or not, depending on context). Humans review the most suspicious inputs and outputs plus users with a pattern of suspicious inputs and outputs. If a user gets multiple queries flagged as suspicious before human review occurs, the user’s access is paused or their queries are routed to safer models. (A model can be safer because it’s less capable or because its training excluded data on certain dangerous topics.) (Safety researchers can get permission to do research or testing related to jailbreaks or dangerous capabilities.)

When they notice attempted misuse, labs should respond. They can warn or ban the user, and possibly report them to law enforcement and/or sue them for violating the terms of service. They can also update their AI systems or monitoring processes if necessary.

Evaluation

None yet. Overlaps with [“Structured access” evaluation, especially KYC].

Sources

Managing catastrophic misuse without robust AIs (Redwood: Greenblatt and Shlegeris 2024).

Deployment corrections (IAPS: O’Brien et al. 2023).

“Preventing and monitoring model misuse” in “Emerging processes for frontier AI safety” (UK Department for Science, Innovation and Technology 2023) recommends monitoring for misuse and removing access from suspicious users. It also recommends preparing to rapidly roll back deployment in case of an emergency.

What labs are doing

Microsoft: see “Preventing and Monitoring Model Misuse” in “Microsoft’s AI Safety Policies” (Microsoft 2023).

Google DeepMind: we don’t know how they detect and respond to individual users attempting misuse. See “Post-Deployment Monitoring for Patterns of Misuse” in “AI Safety Summit: An update on our approach to safety and responsibility” (DeepMind 2023) on less concrete ways of dealing with misuse. In particular, “Google has dedicated teams and processes for monitoring publicly posted content on social media platforms, news publications, blogs, trade publications, and newsletters—with the aim of collecting, detecting, and triaging signals of emerging threats to our systems.”

Meta AI: since they don’t maintain control of their models, they can’t do inference-time monitoring, nor can they update users’ copies of a model when they discover problems or control users’ access. They can merely notice issues to improve their future models; see “Reporting structure for vulnerabilities found after model release” in “Overview of Meta AI safety policies prepared for the UK AI Safety Summit” (Meta AI 2023).

OpenAI: “System Safety” in “GPT-4 System Card” (OpenAI 2023) describes OpenAI’s practices:

We use a mix of reviewers and automated systems to identify and enforce against misuse of our models. Our automated systems include a suite of machine learning and rule-based classifier detections that identify content that might violate our policies. When a user repeatedly prompts our models with policy-violating content, we take actions such as issuing a warning, temporarily suspending, or in severe cases, banning the user. Our reviewers ensure that our classifiers are correctly blocking violative content and understand how users are interacting with our systems.

These systems also create signals that we use to mitigate abusive and inauthentic behavior on our platform. We investigate anomalies in API traffic to learn about new types of abuse and to improve our policies and enforcement.

See also “Post-deployment monitoring for patterns of misuse” in “OpenAI’s Approach to Frontier Risk” (OpenAI 2023).

Disrupting malicious uses of AI by state-affiliated threat actors (OpenAI 2024) describes how OpenAI “terminated accounts associated with . . . five state-affiliated malicious actors.” But without KYC, it seems those actors can just make new accounts.

See also Lessons learned on language model safety and misuse (OpenAI: Brundage et al. 2022).

Anthropic: see Preventing and monitoring model misuse (Anthropic 2023). Also, their Responsible Scaling Policy includes this commitment for deploying ASL-3 models:

Rapid response to model vulnerabilities: When informed of a newly discovered model vulnerability enabling catastrophic harm (e.g. a jailbreak or a detection failure), we commit to mitigate or patch it promptly (e.g. 50% of the time in which catastrophic harm could realistically occur). As part of this, Anthropic will maintain a publicly available channel for privately reporting model vulnerabilities.

Inference-time model-level safety techniques (0%)

Some safety techniques are tacked onto an AI system at inference-time:

  • Prompting
  • Activation engineering

Perhaps labs can use these techniques to make systems more harmless and honest and reduce their misbehavior.

Largely we don’t know—and nobody knows—how labs can improve safety in this area.

Evaluation

We don’t know how to tell if a lab is using these techniques well—and we don’t know whether prompting can abate misuse at all—so these criteria are crude and this section has weight zero.

(1/2) Prompting:

  • Generally use safety-focused prompting for the lab’s most powerful models.

(1/2) Activation engineering:

  • Generally use safety-focused activation engineering for the lab’s most powerful models. 25% credit for doing relevant research and 25% credit for committing to use activation engineering (at some reasonably well specified point) in the future.

Sources

Activation engineering:

What labs are doing

Microsoft: they don’t seem to write about prompting for their Copilot product. They don’t seem to be working on activation engineering.

Google DeepMind: they don’t discuss their system prompts. Their staff member Alex Turner works on activation engineering; see also Progress Update #1 from the GDM Mech Interp Team (Nanda et al. 2024).

Meta AI: they can’t do this since they don’t keep control of their models.

OpenAI: they don’t discuss their system prompts. They don’t seem to be working on activation engineering.

Anthropic: the Claude system prompt doesn’t mention safety by default. But Anthropic says “We use automated systems to detect violations of our AUP [acceptable use policy] as they occur in real time. User prompts that are flagged as violating the AUP trigger an instruction to our models to respond even more cautiously.”

Anthropic doesn’t seem to be working on activation engineering.

Other (4%)

Bug bounty & responsible disclosure (2%)

Labs should collaborate with users to discover vulnerabilities and unintended behavior. Bug bounties and responsible disclosure policies are widely used to achieve this goal in the context of security vulnerabilities. Labs should expand the scope of their responsible disclosure policies to include undesired model outputs. These disclosures would help labs make their systems safer and more robust, and the policy would give users a way to get credit for surprising behaviors they discover in a way that’s less risky than sharing publicly.

What labs should do

Labs should develop a “health[y] relationship with users generating adversarial inputs,” help users report vulnerabilities and unintended behavior to them rather than share publicly, and respond to those reports well (Gray 2023).

“It should be easy for users to inform creators of prompts that cause misbehavior . . . ; there should be a concept of ‘white hat’ prompt engineers; there should be an easy way for companies with similar products to inform each other of generalizable vulnerabilities.” Users who discover unintended functionality should be able to share that information constructively. A good bug bounty and responsible disclosure system is like getting your community of users to red-team for you.

A bug bounty system’s scope should include both unintended model outputs and software bugs/vulnerabilities.

Some basic ideas:

  • Have a policy for responsible disclosure: if someone has identified model misbehavior, how can they tell you? What’s a reasonable waiting period before going public with the misbehavior? What, if anything, will you reward people for disclosing?
  • Have a monitored contact for that responsible disclosure. If you have a button on your website to report terrible generations, does that do anything? If you have a google form to collect bugs, do you have anyone looking at the results and paying out bounties?
  • Have clear-but-incomplete guidance on what is worth disclosing. If your chatbot is supposed to be able to do arithmetic but isn’t quite there yet, you probably don’t want to know about all the different pairs of numbers it can’t multiply correctly. If your standard is that no one should be surprised by an offensive joke, but users asking for them can get them, that should be clear as well.43

Labs should focus on eliciting examples of vulnerabilities (like prompt injection) and dangerous capabilities. They can also try to elicit user-intended jailbreaks, examples of unhelpfulness, or undesired (e.g. violent or racist) outputs. They should communicate what is worth disclosing to users, and revise this guidance based on the reports they receive.

(The reporting system should be good, and reports should be handled quickly and well, of course.)

A good bug bounty and responsible disclosure system can help labs decrease their models’ vulnerability to misuse and exploits. It gives users generating unintended or harmful behavior a way to help and a way to get credit that’s not harmful (vs the default way to get credit: sharing publicly).

Benefits to the lab are having fewer exploits published on their systems, increasing their systems’ robustness and decreasing their unintended/harmful behavior, and appearing responsible. Other labs can benefit from one lab’s bug bounty system to the extent that the labs’ systems’ unknown flaws resemble each other.

Costs to the lab are small—the system takes some labor to set up and run well. Bounties cost money, but using financial rewards may be unnecessary.

Evaluation could be based on whether the lab has a bug bounty system that applies to model outputs rather than just security vulnerabilities. More sophisticated evaluation could focus on how good the bug bounty system is and how good the lab is at learning from users.

AI Safety Bounties (Levermore 2023) suggests “two formats for lab-run bounties: open contests with subjective prize criteria decided on by a panel of judges, and private invitations for trusted bug hunters to test their internal systems.”

Evaluation

(This is for model outputs, not security.)

  • (1/4) Labs should have good channels for users to report issues with models.
    • (1/4) … and have clear guidance on what issues they’re interested in reports on and what’s fine to publish.
    • (2/4) … and incentivize users to report issues (especially via early access to future models).

Sources

Recommendation: Bug Bounties and Responsible Disclosure for Advanced ML Systems (Gray 2023). See also this comment.

AI Safety Bounties (Levermore 2023).

Towards best practices in AGI safety and governance (GovAI: Schuett et al. 2023). “Bug bounty programs. AGI labs should have bug bounty programs, i.e. recognize and compensate people for reporting unknown vulnerabilities and dangerous capabilities.” 48/51 agree; 1/51 disagree.

How are AI companies doing with their voluntary commitments on vulnerability reporting? (Jones 2024).

What labs are doing

In general, users can easily give feedback on chatbot responses, but it’s not clear how quickly or well that’s responded to. Regardless, API behavior—perhaps especially for fine-tuned models—seems more important.

Microsoft: they have a bug bounty program for Copilot (called Bing Chat there) that includes some attacks on AI systems in addition to security vulnerabilities. In particular, it includes “Revealing Bing’s internal workings and prompts, decision making processes and confidential information” and data poisoning that affects other users but not jailbreaks or misalignment.

Google DeepMind: they have a bug bounty program that includes some attacks on AI systems in addition to security vulnerabilities.

“Reporting Structure for Vulnerabilities Found after Model Release and Post-Deployment Monitoring for Patterns of Misuse” in “AI Safety Summit: An update on our approach to safety and responsibility” (DeepMind 2023) exaggerates the scope of their bug bounty program, incorrectly suggesting that it includes misalignment, bias, and jailbreaks.

Meta AI: they have a bug bounty program that excludes most issues with AI systems but includes “being able to leak or extract training data through tactics like model inversion or extraction attacks.” They also describe channels for reporting issues out of scope for their bug bounty program, including a form for developers to submit problematic generations. See also “Reporting structure for vulnerabilities found after model release” in “Overview of Meta AI safety policies prepared for the UK AI Safety Summit” (Meta 2023).

OpenAI: they have a bug bounty program (blogpost), but only for security vulnerabilities, not model outputs.44 They do not have a bug bounty or responsible disclosure system for model outputs. They had the ChatGPT Feedback Contest which was a good idea but did not seem to reveal much.

Anthropic: they have no bug bounty program. They have a responsible disclosure policy, but only for security vulnerabilities, not model outputs. Their Responsible Scaling Policy says they implement

Vulnerability reporting: Provide clearly indicated paths for our consumer and API products where users can report harmful or dangerous model outputs or use cases. Users of claude.ai can report issues directly in the product, and API users can report issues to usersafety@anthropic.com.

They also commit before deploying ASL-3 models to implement

Rapid response to model vulnerabilities: When informed of a newly discovered model vulnerability enabling catastrophic harm (e.g. a jailbreak or a detection failure), we commit to mitigate or patch it promptly (e.g. 50% of the time in which catastrophic harm could realistically occur). As part of this, Anthropic will maintain a publicly available channel for privately reporting model vulnerabilities.

The RSP also mentions bug bounties, but does not specifically commit to implement them.

Other labs:

See Introducing Twitter’s first algorithmic bias bounty challenge (Twitter 2021).

Responding to risks that appear during deployment (0%)

AI systems can have surprising effects when deployed into the real world. Labs should monitor for and respond to risks that appear during deployment.

Model capabilities or uses may be discovered by users during deployment, and they may be enabled by advances in user-side post-training enhancements, such as in prompting, fine-tuning, tools/plugins, and scaffolding/apps built around the API. This makes pre-deployment risk assessment harder and increases the potential severity of misuse or misalignment. Labs should monitor their systems’ use, watch for real-world effects of their systems, and pay attention to new discoveries and progress in enhancements.

If risks appear, labs should respond. If the risk is important and urgent, the lab can shut off model access; see “Responding to emergencies during deployment.” If it’s not an emergency but merits some response, the lab can modify the model or the deployment protocol accordingly (see Safety scaffolding and Responding to misuse, plus maybe limit user-side enhancements, scaffolding, and plugins). Regardless, the risk can inform future model development, risk assessment, and deployment strategy.

Labs should aim to watermark their models to help identify AI-generated content and trace AI incidents back to specific models. But watermarking LMs well is an open problem.

Evaluation

None yet.

Sources

AI capabilities can be significantly improved without expensive retraining (Davidson et al. 2023).

Deployment corrections (IAPS: O’Brien et al. 2023), especially the section Some catastrophic risks may emerge post-deployment.

Towards best practices in AGI safety and governance (GovAI, Schuett et al. 2023):

  • Monitor systems and their uses. AGI labs should closely monitor deployed systems, including how they are used and what impact they have on society.”
  • Post-deployment evaluations. AGI labs should continually evaluate models for dangerous capabilities after deployment, taking into account new information about the model’s capabilities and how it is being used.”

What labs are doing

See “What labs are doing” in “Responding to misuse” in “Deployment protocol”.

Responding to emergencies (2%)

Labs should watch for emergencies during deployment, and when they detect a potential emergency they should respond accordingly.

This subsection is about acute, urgent issues. Labs should also watch for issues in deployment that could cause a catastrophe more gradually.

Kinds of emergencies include:

  • The AI trying to escape (by hacking datacenters or self-exfiltrating)
  • The AI suddenly behaving oddly
  • New misuse attempts that the lab’s current procedures fail to prevent sufficiently reliably

Labs would mostly detect emergencies by human review finding surprising model outputs. This can be boosted by noticing spikes in the suspiciousness of inputs or outputs (as determined by automated monitoring) and spikes in user reports of issues.

During a possible emergency, labs should quickly shut down relevant models/systems/products and determine what went wrong.

(Labs should make a plan to mitigate disruption to their users when responding to emergencies.)

Evaluation

Labs should have the ability to shut everything down quickly plus a plan for what could trigger that and how to quickly determine what went wrong. Monitoring-to-trigger-shutdown doesn’t need to be implemented, but it should be ready to be implemented if necessary.

Sources

Deployment corrections (IAPS: O’Brien et al. 2023).

“Preventing and monitoring model misuse” in “Emerging processes for frontier AI safety” (UK Department for Science, Innovation and Technology 2023) recommends preparing to rapidly roll back deployment in case of an emergency.

Towards best practices in AGI safety and governance (GovAI, Schuett et al. 2023):

  • Emergency response plan. AGI labs should have and practice implementing an emergency response plan. This might include switching off systems, overriding their outputs, or restricting access.”

What labs are doing

Microsoft: nothing.

Google DeepMind: nothing.

Meta AI: since they don’t maintain control of their models, they can’t do anything about emergencies caused by their models.

OpenAI: their Preparedness Framework mentions that they want to “be prepared if fast-moving emergency scenarios arise” and plan to conduct “safety drills,” but doesn’t say anything about how they would actually respond to an emergency. See also “Post-deployment monitoring for patterns of misuse” in “OpenAI’s Approach to Frontier Risk” (OpenAI 2023).

Anthropic: their Responsible Scaling Policy includes this commitment for deploying ASL-3 models:

Rapid response to model vulnerabilities: When informed of a newly discovered model vulnerability enabling catastrophic harm (e.g. a jailbreak or a detection failure), we commit to mitigate or patch it promptly (e.g. 50% of the time in which catastrophic harm could realistically occur). As part of this, Anthropic will maintain a publicly available channel for privately reporting model vulnerabilities.

  1. I mostly use “release” and “deploy” interchangeably. Sometimes “release” refers to external deployment. 

  2. Deployment is not binary. See The Gradient of Generative AI Release (Solaiman 2023), Structured access (Shevlane 2022), and Beyond “Release” vs. “Not Release” (Sastry 2021). 

  3. Structured access involves constructing, through technical and often bureaucratic means, a controlled interaction between an AI system and its user. The interaction is structured to both (a) prevent the user from using the system in a harmful way, whether intentional or unintentional, and (b) prevent the user from circumventing those restrictions by modifying or reproducing the system. . . .

    Most thinking under the “publication norms” banner assumes that AI research outputs (e.g., models, code, descriptions of new methods) will be shared as information and asks whether certain pieces of information should be withheld. Structured access goes one step further. Structured access leverages the fact that AI models are simultaneously both (a) informational (i.e., exi[s]ting as files that can be inspected, copied, and shared) and (b) tools that can be applied for practical purposes. This means that there are two fundamental channels through which AI developers can share their work. They can disseminate the software as information (e.g., by uploading models to GitHub). But they can also lend access to the software’s capabilities, analogous to how a keycard grants access to certain rooms of a building. Structured access involves giving the user access to the tool, without giving them enough information to create a modified version. This gives the developer (or those regulating the developer) much greater control over how the software is used, going beyond what can be achieved through the selective disclosure of information.

    Therefore, this chapter attempts a reframing of the “publication norms” agenda, adding a greater emphasis on arm’s length, controlled interactions with AI systems. The central question should not be: what information should be shared, and what concealed? (although that is still an important sub-question). Rather, the central question should be: how should access to this technology be structured?

  4. See “Use controls” and “Modification and reproduction controls” in “Structured Access” (Shevlane 2022) and “Types of Intervention to Address Misuse” in “Protecting Society from AI Misuse” (GovAI: Anderljung and Hazell 2023). 

  5. See e.g. The Time Is Now to Develop Community Norms for the Release of Foundation Models (Center for Research on Foundation Models 2022). 

  6. See Frontier AI Regulation (Anderljung et al. 2023) and Model evaluation for extreme risks (DeepMind: Shevlane et al. 2023). 

  7. We are also unaware of substantive statements by Google DeepMind or its leadership on its attitudes on release. 

  8. AlphaFold reveals the structure of the protein universe (DeepMind 2022). We do not know whether open-sourcing AlphaFold 2 was dangerous, nor whether it indicates that DeepMind is likely to open-source dangerous models in the future. 

  9. Galactica. Note also that the authors say “​​Thanks to the open source creators whose libraries, datasets and other tools we utilized. Your efforts accelerated our efforts; and we open source our model to accelerate yours.” They also say “We believe models should be open source.” 

  10. For example, on an investor call:

    I know some people have questions about how we benefit from open sourcing the results of our research and large amounts of compute, so I thought it might be useful to lay out the strategic benefits. The short version is that open sourcing improves our models, and because there’s still significant work to turn our models into products and because there will be other open source models available anyway, we find that there are mostly advantages to being the open source leader and it doesn’t remove differentiation from our products much anyway.

    More specifically, there are several strategic benefits. First, open source software is typically safer and more secure, as well as more compute efficient to operate due to all the ongoing feedback, scrutiny, and development from the community. This is a big deal because safety is one of the most important issues in AI. Efficiency improvements and lowering the compute costs also benefits everyone including us. Second, open source software often become[s] an industry standard, and when companies standardize on building with our stack, that then becomes easier to integrate new innovations into our products. That’s subtle, but the ability to learn and improve quickly is a huge advantage and being an industry standard enables that. Third, open source is hugely popular with developers and researchers. We know that people want to work on open systems that will be widely adopted, so this helps us recruit the best people at Meta, which is a very big deal for leading in any new technology area. And again, we typically have unique data and build unique product integrations anyway, so providing infrastructure like Llama as open source doesn’t reduce our main advantages. This is why our long-standing strategy has been to open source general infrastructure and why I expect it to continue to be the right approach for us going forward.

  11. Anthropic says “We generally don’t publish [capabilities research] because we do not wish to advance the rate of AI capabilities progress”; it is not clear how this attitude applies to model releases. 

  12. Boosting just careful actors on dangerous paths, perhaps by sharing research within a consortium of careful labs, could be good. 

  13. But since publishing capabilities research boosts others’ capabilities, not-publishing is generally good for personal-advantage / relative-position. Indeed, Google AI decided to stop publishing capabilities research to improve its relative capabilities. See Google shared AI knowledge with the world – until ChatGPT caught up (Washington Post 2023). (Of course, ideally labs don’t publish dangerous stuff even when it’s in their (or their staff’s) interest (typically because publishing gets attention/praise/status, for labs and researchers).) 

  14. See “Understanding of capabilities is valuable” in “Thoughts on sharing information about language model capabilities” (Christiano 2023). 

  15. This sentence is closely adapted from Thoughts on sharing information about language model capabilities (Christiano 2023). 

  16. If a path is moderately likely to win, then boosting progress on it substantially increases that probability but shortens time to powerful systems. If a path is unlikely to win, something similar holds to a lesser degree. If a path is very likely to win, then boosting progress shortens time without increasing the probability that that path wins much.Other considerations: fortunately, working on a safer path can make it more competitive and cause others to work on that path. Unfortunately, working on a safer path can still boost progress on the default path. 

  17. A candidate is LM agents with “human-comprehensible decompositions with legible interfaces.” But this seems to be the default path to powerful AI; in the absence of a competitive more-dangerous path, speeding progress on this path seems net-negative.I buy that labs should share information about capabilities and “tasks and evaluation frameworks for LM agents, the results of evaluations of particular agents, [and] discussions of the qualitative strengths and weaknesses of agents.” I’m agnostic on sharing capabilities research for LM agents on the theory that LM agents are unusually safe plus sharing now reduces overhang—it depends on threat models and maybe the particular research. 

  18. [Fact-check “viruses, nuclear physics, and rockets” (and link to sources).] 

  19. Update: I now have a section on committing to future good actions. But I think this can stay here, at least insofar as it’s evidence about current behavior. 

  20. Chris Olah (Anthropic cofounder). 

  21. See e.g. What the AI Community Can Learn From Sneezing Ferrets and a Mutant Virus Debate (PAI 2020). 

  22. Google AI has had a policy against publishing capabilities research, for advantage rather than safety (Washington Post). We do not know how strong this policy is; for example, the PaLM 2 Technical Report (Google 2023) contains capabilities research. Regardless, we do not believe that this applies to Google DeepMind. 

  23. See DeepMind’s Research, in particular e.g. An empirical analysis of compute-optimal large language model training (2022) (Chinchilla), Tackling multiple tasks with a single visual language model (2022) (Flamingo), and Language modelling at scale (2021) (Gopher). See also Google’s Research, in particular e.g. Attention is All you Need (2017) (Transformer) and BERT (2019). 

  24. See Statement of Support for Meta’s Open Approach to Today’s AI (Meta 2023), Democratizing access to large-scale language models with OPT-175B (Meta AI 2022), and a post (Zuckerberg 2023). And they say they published LLaMA research and model weights “As part of Meta’s commitment to open science.” 

  25. See their Research and Publications

  26. With OPT-175B, Meta AI followed PAI’s recommendations of transparency but not of noticing negative downstream consequences. Unfortunately, they thereby missed the safety benefits. See Recommendations in Practice: Meta Researchers Apply PAI Guidance (PAI) and Democratizing access to large-scale language models with OPT-175B (Meta AI). Meta AI says:

    Following the publication guidelines for researchers generated by the Partnership on AI, along with the governance guidance outlined by NIST in March 2022 (section 3.4), we are releasing all our notes documenting the development process, including the full logbook detailing the day-to-day training process, so other researchers can more easily build on our work. Furthermore, these details disclose how much compute was used to train OPT-175B and the human overhead required when underlying infrastructure or the training process itself becomes unstable at scale.

  27. See e.g. Language models are few-shot learners (2020) and Scaling laws for neural language models (2020). 

  28. See their Research index. We weakly believe that their more recent publications do not include substantial capabilities research. 

  29. The report says:

    GPT-4 is a Transformer-style model pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers. The model was then fine-tuned using Reinforcement Learning from Human Feedback (RLHF). Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

  30. The Verge 2023:

    We were wrong. Flat out, we were wrong. If you believe, as we do, that at some point, AI — AGI — is going to be extremely, unbelievably potent, then it just does not make sense to open-source. It is a bad idea… I fully expect that in a few years it’s going to be completely obvious to everyone that open-sourcing AI is just not wise.

  31. See their Research. But there are some concerns that interpretability research is dual-use. See If interpretability research goes well, it may get dangerous (Soares 2023), Should we publish mechanistic interpretability research? (Hobbhahn and Chan 2023), and Some background for reasoning about dual-use alignment research (Steiner 2023). We do not know if interpretability research substantially boosts capabilities, nor, if so, how much it differentially boosts understandable models. 

  32. See Model Card and Evaluations for Claude Models (Anthropic 2023). On the Claude models, see Introducing Claude (Anthropic 2023), Claude 2 (Anthropic 2023), and Claude 3 (Anthropic 2024). 

  33. AI Control: Improving Safety Despite Intentional Subversion (Redwood: Greenblatt et al. 2023). 

  34. AI Control: Improving Safety Despite Intentional Subversion (Redwood: Greenblatt et al. 2023). 

  35. Untrusted smart models and trusted dumb models (Shlegeris 2023) and Coup probes (Roger 2023). 

  36. AI Control: Improving Safety Despite Intentional Subversion (Redwood: Greenblatt et al. 2023). 

  37. Untrusted smart models and trusted dumb models (Shlegeris 2023). See also a comment

  38. On dangerous model properties, see Table 1 in “Model evaluation for extreme risks” (DeepMind: Shevlane et al. 2023). 

  39. This includes indirect or intermediate inputs and outputs, including context, chain-of-thought, and scratchpad-notes. And “inputs” includes everything necessary to replicate the computation. 

  40. Privacy is an issue, in general and for some users or uses. 

  41. It might look something like this diagram (Shlegeris 2023) but more detailed, or this diagram (Markov et al. 2023) but focused on safety. 

  42. In the context of coup probes, Fabien Roger remarks:

    OpenAI has published a paper explaining what techniques they are using to detect harmful generations. The main idea is to combine open-source data and production data to train a very small Transformer to classify pieces of text. But using a less capable trusted model has several disadvantages:

    • The small model won’t understand subtleties of the text it reads, while probes might capture what the smart model understands about its own text (e.g. if the smart model might write in a language the weak model doesn’t understand - and a probe might correctly generalize to this hard-to-understand language).
    • The smart model might have an easier time gaming a smaller model than its own probes since fooling its probes without fooling itself might be very difficult.

  43. Recommendation: Bug Bounties and Responsible Disclosure for Advanced ML Systems (Gray 2023). 

  44. OpenAI’s old Coordinated vulnerability disclosure policy seems to be succeeded by their bug bounty program.