Commitments

This page collects AI companies’ commitments relevant to AI safety and extreme risks.

Commitments by several companies

White House voluntary commitments

The White House voluntary commitments were joined by Amazon, Anthropic, Google, Inflection, Meta, Microsoft, and OpenAI in July 2023 and Adobe, Cohere, IBM, Nvidia, Palantir, Salesforce, Scale AI, and Stability AI in September 2023.

The commitments “apply only to generative models that are overall more powerful than the current most advanced model produced by the company making the commitment,” but all relevant companies have created more powerful models since making the commitments. The commitments most relevant to safety are:

  • ”[I]nternal and external red-teaming of models or systems in areas including misuse, societal risks, and national security concerns, such as bio, cyber, [autonomous replication,] and other safety areas.”
  • ”[A]dvanc[e] ongoing research in AI safety, including on the interpretability of AI systems’ decision-making processes and on increasing the robustness of AI systems against misuse.”
  • “Work toward information sharing among companies and governments regarding trust and safety risks, dangerous or emergent capabilities, and attempts to circumvent safeguards”: “establish or join a forum or mechanism through which they can develop, advance, and adopt shared standards and best practices for frontier AI safety.”
  • “Invest in cybersecurity and insider threat safeguards to protect proprietary and unreleased model weights”: “limit[] access to model weights to those whose job function requires it and establish[] a robust insider threat detection program consistent with protections provided for their most valuable intellectual property and trade secrets. In addition . . . stor[e] and work[] with the weights in an appropriately secure environment to reduce the risk of unsanctioned release.”
  • “Incent third-party discovery and reporting of issues and vulnerabilities”: establish “bounty systems, contests, or prizes to incent the responsible disclosure of weaknesses, such as unsafe behaviors, or [] include AI systems in their existing bug bounty programs.”
  • “Publicly report model or system capabilities, limitations, and domains of appropriate and inappropriate use, including discussion of societal risks, such as effects on fairness and bias”: “publish reports for all new significant model public releases . . . . These reports should include the safety evaluations conducted (including in areas such as dangerous capabilities, to the extent that these are responsible to publicly disclose) . . . and the results of adversarial testing conducted to evaluate the model’s fitness for deployment” and “red-teaming and safety procedures”

Some companies announced and commented on their commitments, including Microsoft (which simultaneously made additional commitments), Google, and OpenAI.

We track some companies’ compliance with these commitments in the table below. Note that there was no specific timeline for implementing these commitments.

  Microsoft Google Meta OpenAI Anthropic
Red-teaming No. They seem to do red-teaming just for undesired content, not dangerous capabilities. Yes. But their bio evals are crude, but they are improving them. Partial. They do some red-teaming for bio, cyber, and more, but not autonomous replication, and they say little about bio. Yes. Or at least they’re working on this, but they haven’t finished creating their evals. See also their Preparedness Framework (Beta). Yes. See also their Responsible Scaling Policy.
Safety research A little. Yes, lots. A little. Yes, lots. Yes, lots.
Information sharing1 Unknown. Unknown. Unknown. Unknown. Unknown.
Securing model weights Unknown. Unknown. Unknown. Unknown. Unknown.
AI bug bounty Partial. Their AI bounty program includes “Revealing Bing’s internal workings and prompts, decision making processes and confidential information” and data poisoning that affects other users. It excludes attacks to get the model to help with misuse. Partial. Their bug bounty program includes some attacks on AI systems. It excludes jailbreaks, even if they lead to unsafe behaviors. Partial. Their bug bounty program includes “reports that demonstrate integral privacy or security issues associated with Meta’s large language model, Llama 2, including being able to leak or extract training data through tactics like model inversion or extraction attacks.” It excludes other adversarial attacks and “risky content generated by the model.” No. Their bug bounty program excludes issues with models. No. They do not have a bug bounty program.
Reporting Yes, but they don’t do dangerous capability evaluations. See the Phi-3 report and model cards. Yes. See the Gemini report and Evaluating Frontier Models for Dangerous Capabilities. Yes. See the Llama 3 model card. Mostly. See the GPT-4 system card. But they have not released the dangerous capability evaluations for their most powerful model, GPT-4o. Yes. See the Claude 3 model card.
   
Red-teaming Microsoft: No. They seem to do red-teaming just for undesired content, not dangerous capabilities.
Google: Yes. But their bio evals are crude, but they are improving them.
Meta: Partial. They do some red-teaming for bio, cyber, and more, but not autonomous replication, and they say little about bio.
OpenAI: Yes. Or at least they’re working on this, but they haven’t finished creating their evals. See also their Preparedness Framework (Beta).
Anthropic: Yes. See also their Responsible Scaling Policy.
Safety research Microsoft: A little.
Google: Yes, lots.
Meta: A little.
OpenAI: Yes, lots.
Anthropic: Yes, lots.
Information sharing Microsoft: Unknown.
Google: Unknown.
Meta: Unknown.
OpenAI: Unknown.
Anthropic: Unknown.
Securing model weights Microsoft: Unknown.
Google: Unknown.
Meta: Unknown.
OpenAI: Unknown.
Anthropic: Unknown.
AI bug bounty Microsoft: Partial. Their AI bounty program includes “Revealing Bing’s internal workings and prompts, decision making processes and confidential information” and data poisoning that affects other users. It excludes attacks to get the model to help with misuse.
Google: Partial. Their bug bounty program includes some attacks on AI systems. It excludes jailbreaks, even if they lead to unsafe behaviors.
Meta: Partial. Their bug bounty program includes “reports that demonstrate integral privacy or security issues associated with Meta’s large language model, Llama 2, including being able to leak or extract training data through tactics like model inversion or extraction attacks.” It excludes other adversarial attacks and “risky content generated by the model.”
OpenAI: No. Their bug bounty program excludes issues with models.
Anthropic: No. They do not have a bug bounty program.
Reporting Microsoft: Yes, but they don’t do dangerous capability evaluations. See the Phi-3 report and model cards.
Google: Yes. See the Gemini report and Evaluating Frontier Models for Dangerous Capabilities.
Meta: Yes. See the Llama 3 model card.
OpenAI: Mostly. See the GPT-4 system card. But they have not released the dangerous capability evaluations for their most powerful model, GPT-4o.
Anthropic: Yes. See the Claude 3 model card.

Recall this is the very relevant to safety subset of the White House voluntary commitments.

AI Safety Summit

The UK AI Safety Summit occurred in November 2023.

AI Safety Policies

The UK Department for Science, Innovation and Technology published best practices for AI safety and responses describing practices from Amazon, Anthropic, DeepMind, Inflection, Meta, Microsoft, and OpenAI.

Safety testing

POLITICO reported after the summit that “under a new agreement ‘like-minded governments’ would be able to test eight leading tech companies’ AI models before they are released.” We suspect this is mistaken and no such agreement exists. Our guess is this claim is a misinterpretation of the UK AI Safety Summit “safety testing” session statement. We are interested in better sources or clarification.

POLITICO reported that as of April 2024, DeepMind has shared pre-deployment model access with the UK AI Safety Institute and Meta, OpenAI, and Anthropic have not. If the companies did commit to share access, then at least Anthropic, Meta, and maybe Microsoft should have shared Claude 3, Llama 3, and maybe Phi-3, respectively. DeepMind seems to have shared only very shallow access with external evaluators.

POLITICO’s reporting suggests that Meta and Anthropic don’t consider themselves bound by the (ostensible) agreement.

Frontier Model Forum

The Frontier Model Forum is “a joint . . . effort between OpenAI, Anthropic, Google, and Microsoft.” We hoped that the Forum would help the companies coordinate on best practices for safety, but it seems ineffective at this goal; there are no substantive practices, standards, or commitments associated with it.


Companies are sorted by how much we like their commitments.

Anthropic

  • Anthropic has a Responsible Scaling Policy. It says Anthropic:
    • Does risk assessment “after every 4x jump in effective compute” (and at least every 3 months, but maybe the 3-month-testing uses older models; it’s unclear)
    • Uses some somewhat-specific security practices
    • Uses very basic deployment safety practices, such as Constitutional AI
    • Will, by the time it reaches the ASL-3 risk threshold, define the ASL-4 risk threshold and corresponding commitments
    • Will, before training ASL-3 models, implement great security practices
    • Will, before deploying ASL-3 models (including internal deployment), implement safety practices such as “automated detection”
    • Will “Implement a non-compliance reporting policy for our Responsible Scaling Commitments as part of reaching ASL-3.”
    • “Document[s] and test[s] internal safety procedures.”
    • “Proactively plan[s] for a pause in scaling.”
    • “Publicly share[s] evaluation results after model deployment where possible.”
    • “Share[s] results of ASL evaluations promptly with Anthropic’s governing bodies” and “share[s] a report on implementation status to our board and LTBT, explicitly noting any deficiencies in implementation.”
    • Will “pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.” (This is clarified several times, including once in boldface.)
  • Anthropic once gave some people the impression that it planned to not release models that push the frontier of publicly available models. In particular, CEO Dario Amodei gave Dustin Moskovitz the impression that Anthropic committed “to not meaningfully advance the frontier with a launch.” Anthropic has not clarified this commitment, even after their Claude 3 launch resulted in confusion about whether this commitment ever existed and what its current status is.
  • Anthropic plans for their Long-Term Benefit Trust—an independent group with a mandate for safety and benefit-sharing—to elect a majority of their board by 2027. But Anthropic’s investors can abrogate the Trust and the details are unclear. (This is part of Anthropic’s internal governance, not really a commitment.)
  • See also Core Views on AI Safety (Anthropic 2023). (This is a discussion of Anthropic’s views and their implications, not a commitment.)

OpenAI

  • OpenAI is developing a Preparedness Framework. Its current beta version says OpenAI will:
    • Do model evals for dangerous capabilities “continually, i.e., as often as needed to catch any non-trivial capability change, including before, during, and after training. This would include whenever there is a >2x effective compute increase or major algorithmic breakthrough.”
    • Publish and frequently update a “scorecard” tracking whether OpenAI’s most powerful model is “low,” “medium,” “high,” or “critical” risk in each risk category, both before and after mitigations. (In our opinion, the thresholds for “high” and “critical” risk are too high and a model just below the “critical” threshold could cause a catastrophe.)
    • External deployment: if the model scores “high” in any category, don’t deploy it externally until implementing mitigations such that it drops to “medium.”
      • No commitments about internal deployment, except as a corollary of commitments about training.
      • The PF briefly discusses mitigations, but that section doesn’t make sense.2
    • Training: before reaching “critical” pre-mitigation risk, OpenAI commits to:
      • Implement mitigations to make the post-mitigation risk at most “high.” On its face, this doesn’t make much sense: mitigations are about how the model is deployed, not something that is implemented during training or that reduces the risks from the model being stolen or escaping. Perhaps OpenAI means it will implement those mitigations for internal deployment.
      • Get “dependable evidence that the model is sufficiently aligned that it does not initiate ‘critical’-risk-level tasks unless explicitly instructed to do so.” This is nice but too vague; OpenAI should explain what such evidence could look like.
    • Security: “If we reach (or are forecasted to reach) at least ‘high’ pre-mitigation risk in any of the considered categories: we will ensure that our security is hardened in a way that is designed to prevent our mitigations and controls from being circumvented via exfiltration (by the time we hit ‘high’ pre-mitigation risk). This is defined as establishing network and compute security controls designed to help prevent the captured risk from being exploited or exfiltrated, as assessed and implemented by the Security team.” This is hopelessly vague.
    • Audits: model evals—or at least their methodologies—will sometimes be reviewed by a third party.
    • Conduct annual safety drills.
    • Forecast future risks.
    • Write an internal report monthly.
  • Stop-and-assist clause in OpenAI’s Charter:

    We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”

  • OpenAI committed to give their Superalignment safety team “20% of the compute we’ve secured to date” (in July 2023), to be used “over the next four years.” This is nice, but it’s unclear how much compute it is, and with compute use increasing exponentially it may be quite little in 2027.
  • OpenAI is fully controlled by a nonprofit board with a mandate for safety and benefit-sharing. However, the previous board was ineffective in its attempt to remove CEO Sam Altman. (This is part of OpenAI’s internal governance, not really a commitment.)
  • OpenAI said “We think it’s important that efforts like ours submit to independent audits before releasing new systems; we will talk about this in more detail later this year.” That was in February 2023; we do not believe they elaborated or have a policy on this topic. (They did mention that they shared early access to GPT-4 with METR, then part of the Alignment Research Center.)

Microsoft

  • Microsoft joined the White House voluntary commitments and simultaneously made additional commitments. But none of them are important enough to mention here.
  • Microsoft’s AI Safety Policies (2023) says “When it comes to frontier model deployment, Microsoft and OpenAI have together defined capability thresholds that act as a trigger to review models in advance of their first release or downstream deployment. The scope of a review, through our joint Microsoft-OpenAI Deployment Safety Board (DSB), includes model capability discovery.” This sounds good, but Microsoft has not elaborated on these capability thresholds, shared details about the DSB, or shared details about past reviews.

See also Microsoft’s Responsible AI page, Responsible AI Transparency Report (2024), and Responsible AI Standard, v2 (2022).

Google & Google DeepMind

  • DeepMind CEO Demis Hassabis said DeepMind will talk about risk assessment and safety policies in 2024.
  • Google’s AI Principles says “we will not design or deploy AI” in areas including “Technologies that cause or are likely to cause overall harm. Where there is a material risk of harm, we will proceed only where we believe that the benefits substantially outweigh the risks, and will incorporate appropriate safety constraints.” But it’s not clear that Google is aware of misalignment and extreme risks from advanced AI or has policies sufficient to detect or prevent hard-to-notice risks.
  • Google’s AI Principles says “we will not design or deploy AI” as “Weapons,” and DeepMind signed the Lethal Autonomous Weapons Pledge.

See also AI Safety Summit: An update on our approach to safety and responsibility (DeepMind 2023), AI Principles Progress Update 2023 (Google), Responsible AI Practices (Google) and Responsibility & Safety (DeepMind).

Meta

No commitments relevant to AI safety and extreme risks beyond the White House voluntary commitments.

See Meta’s Overview of Meta AI safety policies prepared for the UK AI Safety Summit (2023) and Responsible AI page.


Other companies

We are not aware of notable commitments from other companies.

Open letters

Some companies’ leaders have signed open letters including Research Priorities for Robust and Beneficial Artificial Intelligence (FLI 2015) and the Asilomar AI Principles (FLI 2017).

  1. Based on public information, we do not currently consider the Frontier Model Forum (mentioned below) to satisfy the “information sharing” commitment. We don’t believe the companies have said anything substantive on information sharing. 

  2. The Mitigations section just says:

    A central part of meeting our safety baselines is implementing mitigations to address various types of model risk. Our mitigation strategy will involve both containment measures, which help reduce risks related to possession of a frontier model, as well as deployment mitigations, which help reduce risks from active use of a frontier model. As a result, these mitigations might span increasing compartmentalization, restricting deployment to trusted users, implementing refusals, redacting training data, or alerting distribution partners

    This is confusing. Risk assessment is done on pre- and post-mitigation versions of a model. So mitigations must be things like RLHF and inference-time filters—things that affect model evals and red-teaming—not like restricting deployment to certain users or alerting distribution partners.