This page is out of date; in particular it is about the old version of Anthropic’s RSP.
This page collects AI companies’ commitments relevant to AI safety and extreme risks.
Commitments by several companies
White House voluntary commitments
The White House voluntary commitments were joined by Amazon, Anthropic, Google, Inflection, Meta, Microsoft, and OpenAI in July 2023; Adobe, Cohere, IBM, Nvidia, Palantir, Salesforce, Scale AI, and Stability AI in September 2023; and Apple in July 2024.
The commitments “apply only to generative models that are overall more powerful than the current most advanced model produced by the company making the commitment,” but all relevant companies have created more powerful models since making the commitments. The commitments most relevant to safety are:
- ”[I]nternal and external red-teaming of models or systems in areas including misuse, societal risks, and national security concerns, such as bio, cyber, [autonomous replication,] and other safety areas.”
- ”[A]dvanc[e] ongoing research in AI safety, including on the interpretability of AI systems’ decision-making processes and on increasing the robustness of AI systems against misuse.”
- “Work toward information sharing among companies and governments regarding trust and safety risks, dangerous or emergent capabilities, and attempts to circumvent safeguards”: “establish or join a forum or mechanism through which they can develop, advance, and adopt shared standards and best practices for frontier AI safety.”
- “Invest in cybersecurity and insider threat safeguards to protect proprietary and unreleased model weights”: “limit[] access to model weights to those whose job function requires it and establish[] a robust insider threat detection program consistent with protections provided for their most valuable intellectual property and trade secrets. In addition . . . stor[e] and work[] with the weights in an appropriately secure environment to reduce the risk of unsanctioned release.”
- “Incent third-party discovery and reporting of issues and vulnerabilities”: establish “bounty systems, contests, or prizes to incent the responsible disclosure of weaknesses, such as unsafe behaviors, or [] include AI systems in their existing bug bounty programs.”
- “Publicly report model or system capabilities, limitations, and domains of appropriate and inappropriate use, including discussion of societal risks, such as effects on fairness and bias”: “publish reports for all new significant model public releases . . . . These reports should include the safety evaluations conducted (including in areas such as dangerous capabilities, to the extent that these are responsible to publicly disclose) . . . and the results of adversarial testing conducted to evaluate the model’s fitness for deployment” and “red-teaming and safety procedures.”
Some companies announced and commented on their commitments, including Microsoft (which simultaneously made additional commitments), Google, and OpenAI.
We track some companies’ compliance with these commitments in the table below. Note that there was no specific timeline for implementing these commitments.
Microsoft | Meta | OpenAI | Anthropic | ||
---|---|---|---|---|---|
Red-teaming | No. They seem to do red-teaming just for undesired content, not dangerous capabilities. | Yes. But their bio evals are crude, but they are improving them. | Partial. They do some red-teaming for bio, cyber, and more, but not autonomous replication, and they say little about bio. | Mostly. They’re working on this, but they haven’t finished creating their evals. See also their Preparedness Framework (Beta). | Yes. See also their Responsible Scaling Policy. |
Safety research | A little. | Yes, lots. | A little. | Yes, lots. | Yes, lots. |
Information sharing1 | Unknown. | Unknown. | Unknown. | Unknown. | Unknown. |
Securing model weights | Unknown. | Unknown. | Unknown. | Unknown. | Unknown. |
AI bug bounty | Partial. Their AI bounty program includes “Revealing Bing’s internal workings and prompts, decision making processes and confidential information” and data poisoning that affects other users. It excludes attacks to get the model to help with misuse.2 | Partial. Their bug bounty program includes some attacks on AI systems. It excludes jailbreaks, even if they lead to unsafe behaviors. | Partial. Their bug bounty program includes “reports that demonstrate integral privacy or security issues associated with Meta’s large language model, Llama 2, including being able to leak or extract training data through tactics like model inversion or extraction attacks.” It excludes other adversarial attacks and “risky content generated by the model.” | No. Their bug bounty program excludes issues with models. | Yes. |
Reporting | Yes, but they don’t do dangerous capability evaluations. See the Phi-3 report and model cards. | Yes. See the Gemini report and Evaluating Frontier Models for Dangerous Capabilities. | Yes. See the Llama 3 model card. | Yes. See the o1 system card. | Yes. See the Claude 3 model card. |
Red-teaming | Microsoft: No. They seem to do red-teaming just for undesired content, not dangerous capabilities. Google: Yes. But their bio evals are crude, but they are improving them. Meta: Partial. They do some red-teaming for bio, cyber, and more, but not autonomous replication, and they say little about bio. OpenAI: Mostly. They’re working on this, but they haven’t finished creating their evals. See also their Preparedness Framework (Beta). Anthropic: Yes. See also their Responsible Scaling Policy. |
Safety research | Microsoft: A little. Google: Yes, lots. Meta: A little. OpenAI: Yes, lots. Anthropic: Yes, lots. |
Information sharing | Microsoft: Unknown. Google: Unknown. Meta: Unknown. OpenAI: Unknown. Anthropic: Unknown. |
Securing model weights | Microsoft: Unknown. Google: Unknown. Meta: Unknown. OpenAI: Unknown. Anthropic: Unknown. |
AI bug bounty | Microsoft: Partial. Their AI bounty program includes “Revealing Bing’s internal workings and prompts, decision making processes and confidential information” and data poisoning that affects other users. It excludes attacks to get the model to help with misuse. Google: Partial. Their bug bounty program includes some attacks on AI systems. It excludes jailbreaks, even if they lead to unsafe behaviors. Meta: Partial. Their bug bounty program includes “reports that demonstrate integral privacy or security issues associated with Meta’s large language model, Llama 2, including being able to leak or extract training data through tactics like model inversion or extraction attacks.” It excludes other adversarial attacks and “risky content generated by the model.” OpenAI: No. Their bug bounty program excludes issues with models. Anthropic: Yes. |
Reporting | Microsoft: Yes, but they don’t do dangerous capability evaluations. See the Phi-3 report and model cards. Google: Yes. See the Gemini report and Evaluating Frontier Models for Dangerous Capabilities. Meta: Yes. See the Llama 3 model card. OpenAI: Yes. See the o1 system card. Anthropic: Yes. See the Claude 3 model card. |
Recall this is the very relevant to safety subset of the White House voluntary commitments.
AI Seoul Summit
16 AI companies joined the Frontier AI Safety Commitments in May 2024, basically committing to make responsible scaling policies by February 2025. See New voluntary commitments (AI Seoul Summit) (Stein-Perlman 2024).
AI Safety Summit
The UK AI Safety Summit occurred in November 2023.
AI Safety Policies
The UK Department for Science, Innovation and Technology published best practices for AI safety and responses describing practices from Amazon, Anthropic, DeepMind, Inflection, Meta, Microsoft, and OpenAI.
Safety testing
POLITICO reported after the summit that “under a new agreement ‘like-minded governments’ would be able to test eight leading tech companies’ AI models before they are released.” We suspect this is mistaken: no such agreement exists and this claim is a misinterpretation of the UK AI Safety Summit “safety testing” session statement.
Regardless, Google DeepMind and Anthropic seem to have shared their most recent models with the UK AI Safety Institute for pre-deployment testing and other labs do not.
Frontier Model Forum
The Frontier Model Forum was “a joint . . . effort between OpenAI, Anthropic, Google, and Microsoft,” later joined by Amazon and Meta. We hoped that the Forum would help the companies coordinate on best practices for safety, but it seems ineffective at this goal; there are no substantive practices, standards, or commitments associated with it.
Safety testing
Gina Raimondo: “the [US] AI Safety Institute will soon test all new advanced AI models before deployment, and . . . the leading companies have agreed to share their models” (TIME 2024).
On the UK AI Safety Institute, see above.
Anthropic
- Anthropic has a Responsible Scaling Policy. It says Anthropic:
- Does risk assessment “after every 4x jump in effective compute” (and at least every 3 months, but maybe the 3-month-testing uses older models; it’s unclear).
- Uses some somewhat-specific security practices.
- Uses very basic deployment safety practices, such as Constitutional AI.
- Will, by the time they reach the ASL-3 risk threshold, define the ASL-4 risk threshold and corresponding commitments.
- Will, before training ASL-3 models, implement great security practices.
- Will, before deploying ASL-3 models (including internal deployment), implement safety practices such as “automated detection.”
- Will “Implement a non-compliance reporting policy for our Responsible Scaling Commitments as part of reaching ASL-3.”
- “Document[s] and test[s] internal safety procedures.”
- “Proactively plan[s] for a pause in scaling.”
- “Publicly share[s] evaluation results after model deployment where possible.”
- “Share[s] results of ASL evaluations promptly with Anthropic’s governing bodies” and “share[s] a report on implementation status to our board and LTBT, explicitly noting any deficiencies in implementation.”
- Will “pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.” (This is clarified several times, including once in boldface.)
- Anthropic once gave some people the impression that they planned to not release models that push the frontier of publicly available models. In particular, CEO Dario Amodei gave Dustin Moskovitz the impression that Anthropic committed “to not meaningfully advance the frontier with a launch.” Anthropic has not clarified this commitment, even after their Claude 3 launch resulted in confusion about whether this commitment ever existed and what their current status is.
- Anthropic says “We generally don’t publish [capabilities research] because we do not wish to advance the rate of AI capabilities progress.” An Anthropic cofounder says “we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk. In the case of difficult scenarios, we have a formal infohazard review procedure.”
- Anthropic’s Long-Term Benefit Trust—an independent group with a mandate for safety and benefit-sharing—has the power to elect a majority of its board (as of Nov 2024). But Anthropic’s shareholders can abrogate the Trust and the details are unclear. (This is part of Anthropic’s internal governance, not really a commitment.)
See also Core Views on AI Safety (Anthropic 2023) (a discussion of Anthropic’s views and their implications). See also Our response to the UK Government’s internal AI safety policy enquiries (Anthropic 2023) (a summary of some policies). Anthropic shared Claude 3.5 Sonnet with the UK AI Safety Institute and METR before deployment.
See also Tracking Voluntary Commitments.
Google & Google DeepMind
- DeepMind has a Frontier Safety Framework. It includes:
- “We are aiming to evaluate our models every 6x in effective compute and for every 3 months of fine-tuning progress.”
- “When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan.”
- Google’s AI Principles says “we will not design or deploy AI” in areas including “Technologies that cause or are likely to cause overall harm. Where there is a material risk of harm, we will proceed only where we believe that the benefits substantially outweigh the risks, and will incorporate appropriate safety constraints.” But it’s not clear that Google is aware of misalignment and extreme risks from advanced AI or has policies sufficient to detect or prevent hard-to-notice risks.
- Google’s AI Principles says “we will not design or deploy AI” as “Weapons,” and DeepMind signed the Lethal Autonomous Weapons Pledge.
See also AI Safety Summit: An update on our approach to safety and responsibility (DeepMind 2023), AI Principles Progress Update 2023 (Google), Responsible AI Practices (Google) and Responsibility & Safety (DeepMind). See also DeepMind’s “Frontier Safety Framework” is weak and unambitious (Stein-Perlman 2024).
OpenAI
- OpenAI is developing a Preparedness Framework. OpenAI included some PF-relevant details in the system cards for their most recent models, o1 and GPT-4o, although the 4o system card was delayed for three months. The beta PF says OpenAI will:
- Do model evals for dangerous capabilities “continually, i.e., as often as needed to catch any non-trivial capability change, including before, during, and after training. This would include whenever there is a >2x effective compute increase or major algorithmic breakthrough.”
- Publish and frequently update a “scorecard” tracking whether OpenAI’s most powerful model is “low,” “medium,” “high,” or “critical” risk in each risk category, both before and after mitigations. (In our opinion, the thresholds for “high” and “critical” risk are too high and a model just below the “critical” threshold could cause a catastrophe.)
- External deployment: if the model scores “high” in any category, don’t deploy it externally until implementing mitigations such that it drops to “medium.”
- No commitments about internal deployment, except as a corollary of commitments about training.
- The PF briefly discusses mitigations, but that section doesn’t make sense.3
- Training: before reaching “critical” pre-mitigation risk, OpenAI commits to:
- Implement mitigations to make the post-mitigation risk at most “high.” On its face, this doesn’t make much sense: mitigations are about how the model is deployed, not something that is implemented during training or that reduces the risks from the model being stolen or escaping. Perhaps OpenAI means it will implement those mitigations for internal deployment.
- Get “dependable evidence that the model is sufficiently aligned that it does not initiate ‘critical’-risk-level tasks unless explicitly instructed to do so.” This is nice but too vague; OpenAI should explain what such evidence could look like.
- Security: “If we reach (or are forecasted to reach) at least ‘high’ pre-mitigation risk in any of the considered categories: we will ensure that our security is hardened in a way that is designed to prevent our mitigations and controls from being circumvented via exfiltration (by the time we hit ‘high’ pre-mitigation risk). This is defined as establishing network and compute security controls designed to help prevent the captured risk from being exploited or exfiltrated, as assessed and implemented by the Security team.” This is hopelessly vague.
- Audits: model evals—or at least their methodologies—will sometimes be reviewed by a third party.
- Conduct annual safety drills.
- Forecast future risks.
- Write an internal report monthly.
- Stop-and-assist clause in OpenAI’s Charter:
We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”
- OpenAI committed to give their superintelligence alignment effort “20% of the compute we’ve secured to date” (in July 2023), to be used “over the next four years.” It’s unclear how much compute this was, and with compute use increasing exponentially it may be quite little in 2027. (Many observers misinterpreted the commitment as 20% of compute going forward, and very reasonably interpreted it as for the Superalignment team specifically, and OpenAI failed to correct these interpretations.) Regardless, the Superalignment team dissolved in May 2024. The team’s requests for compute were repeatedly denied and there was no plan for allocating the promised compute. OpenAI later claimed that the compute was never meant for the Superalignment team in particular.
- In March 2024, OpenAI announced “important improvements to OpenAI’s governance structure”; whether they have been implemented or are to be implemented in the future is unclear:
- Adopting a new set of corporate governance guidelines;
- Strengthening OpenAI’s Conflict of Interest Policy;
- Creating a whistleblower hotline to serve as an anonymous reporting resource for all OpenAI employees and contractors; and
- Creating additional Board committees, including a Mission & Strategy committee focused on implementation and advancement of the core mission of OpenAI.
- In May 2024, OpenAI formed the Safety and Security Committee and committed to “publicly share an update on adopted recommendations.”
- OpenAI is fully controlled by a nonprofit board with a mandate for safety and benefit-sharing. However, the previous board was ineffective in its attempt to remove CEO Sam Altman. (This is part of OpenAI’s internal governance, not really a commitment.)
- OpenAI said “We think it’s important that efforts like ours submit to independent audits before releasing new systems; we will talk about this in more detail later this year.” That was in February 2023; we do not believe they elaborated or have a policy on this topic. (They did mention that they shared early access to GPT-4 with METR, then part of the Alignment Research Center.)
- OpenAI said “We are committed to openly sharing our alignment research when it’s safe to do so: We want to be transparent about how well our alignment techniques actually work in practice and we want every AGI developer to use the world’s best alignment techniques.”
- OpenAI’s original4 “mission is to ensure that artificial general intelligence benefits all of humanity.” Sometimes OpenAI and its leadership and staff state that that the mission is to build AGI themselves, including in formal announcements and legal documents.5
See also OpenAI’s Approach to Frontier Risk (2023), OpenAI safety practices (2024), and Best Practices for Deploying Language Models (Cohere, OpenAI, and AI21 Labs 2022).
Microsoft
- Microsoft joined the White House voluntary commitments and simultaneously made additional commitments. But none of them are important enough to mention here.
- Microsoft’s AI Safety Policies (2023) says “When it comes to frontier model deployment, Microsoft and OpenAI have together defined capability thresholds that act as a trigger to review models in advance of their first release or downstream deployment. The scope of a review, through our joint Microsoft-OpenAI Deployment Safety Board (DSB), includes model capability discovery. . . . We have exercised this review process with respect to several frontier models, including GPT-4.” This sounds good, but Microsoft has not elaborated on these capability thresholds, shared details about the DSB, or shared details about past reviews. Microsoft deployed GPT-4 as a test without telling the DSB, and they initially denied this.
See also Microsoft’s Responsible AI page, Responsible AI Transparency Report (2024), and Responsible AI Standard, v2 (2022).
Meta
No commitments relevant to AI safety and extreme risks beyond the White House voluntary commitments.
See Meta’s Overview of Meta AI safety policies prepared for the UK AI Safety Summit (2023) and Responsible AI page.
Other companies
Cohere joined Canada’s Voluntary Code of Conduct on the Responsible Development and Management of Advanced Generative AI Systems.
Open letters
Some companies’ leaders have signed open letters including Research Priorities for Robust and Beneficial Artificial Intelligence (FLI 2015) and the Asilomar AI Principles (FLI 2017).
-
Based on public information, we do not currently consider the Frontier Model Forum (mentioned below) to satisfy the “information sharing” commitment. We don’t believe the companies have said anything substantive on information sharing. ↩
-
To be precise: it previously said that jailbreaks that only affect the user were “Not in Scope.” This is important because it includes jailbreaking the model to elicit dangerous capabilities. Now such jailbreaks are a “Content-related issue” and it’s not clear whether they’re in scope: Microsoft links to their Responsible AI Standard but it doesn’t address this. ↩
-
The Mitigations section just says:
A central part of meeting our safety baselines is implementing mitigations to address various types of model risk. Our mitigation strategy will involve both containment measures, which help reduce risks related to possession of a frontier model, as well as deployment mitigations, which help reduce risks from active use of a frontier model. As a result, these mitigations might span increasing compartmentalization, restricting deployment to trusted users, implementing refusals, redacting training data, or alerting distribution partners
This is confusing. Risk assessment is done on pre- and post-mitigation versions of a model. So mitigations must be things like RLHF and inference-time filters—things that affect model evals and red-teaming—not like restricting deployment to certain users or alerting distribution partners. ↩
-
Also offhandedly, e.g. Anna Makanju. ↩