New voluntary commitments (AI Seoul Summit)

16 companies commit to make RSPs

Zach Stein-Perlman

21 May 2024

Basically the companies commit to make responsible scaling policies.

Part of me says this is amazing, the best possible commitment short of all committing to a specific RSP. It’s certainly more real than almost all other possible kinds of commitments. But as far as I can tell, people pay almost no attention to what RSP-ish documents (Anthropic, OpenAI, Google) actually say and whether the companies are following them.1 The discourse is more like “Anthropic, OpenAI, and Google have safety plans and other companies don’t.” Hopefully that will change.

Maybe “These commitments represent a crucial and historic step forward for international AI governance.” It does seem nice from an international-governance perspective that Mistral AI, TII (the Falcon people), and a Chinese company joined.

Full document:

The UK and Republic of Korea governments announced that the following organisations have agreed to the Frontier AI Safety Commitments:

  • Amazon
  • Anthropic
  • Cohere
  • Google
  • G42
  • IBM
  • Inflection AI
  • Meta
  • Microsoft
  • Mistral AI
  • Naver
  • OpenAI
  • Samsung Electronics
  • Technology Innovation Institute
  • xAI
  • Zhipu.ai

The above organisations, in furtherance of safe and trustworthy AI, undertake to develop and deploy their frontier AI models and systems2 responsibly, in accordance with the following voluntary commitments, and to demonstrate how they have achieved this by publishing a safety framework focused on severe risks by the upcoming AI Summit in France.

Given the evolving state of the science in this area, the undersigned organisations’ approaches (as detailed in paragraphs I-VIII) to meeting Outcomes 1, 2 and 3 may evolve in the future. In such instances, organisations will provide transparency on this, including their reasons, through public updates.

The above organisations also affirm their commitment to implement current best practices related to frontier AI safety, including: internal and external red-teaming of frontier AI models and systems for severe and novel threats; to work toward information sharing; to invest in cybersecurity and insider threat safeguards to protect proprietary and unreleased model weights; to incentivize third-party discovery and reporting of issues and vulnerabilities; to develop and deploy mechanisms that enable users to understand if audio or visual content is AI-generated; to publicly report model or system capabilities, limitations, and domains of appropriate and inappropriate use; to prioritize research on societal risks posed by frontier AI models and systems; and to develop and deploy frontier AI models and systems to help address the world’s greatest challenges.

Outcome 1. Organisations effectively identify, assess and manage risks when developing and deploying their frontier AI models and systems. They will:

I. Assess the risks posed by their frontier models or systems across the AI lifecycle, including before deploying that model or system, and, as appropriate, before and during training. Risk assessments should consider model capabilities and the context in which they are developed and deployed, as well as the efficacy of implemented mitigations to reduce the risks associated with their foreseeable use and misuse. They should also consider results from internal and external evaluations as appropriate, such as by independent third-party evaluators, their home governments3, and other bodies their governments deem appropriate.

II. Set out thresholds4 at which severe risks posed by a model or system, unless adequately mitigated, would be deemed intolerable. Assess whether these thresholds have been breached, including monitoring how close a model or system is to such a breach. These thresholds should be defined with input from trusted actors, including organisations’ respective home governments as appropriate. They should align with relevant international agreements to which their home governments are party. They should also be accompanied by an explanation of how thresholds were decided upon, and by specific examples of situations where the models or systems would pose intolerable risk.

III. Articulate how risk mitigations will be identified and implemented to keep risks within defined thresholds, including safety and security-related risk mitigations such as modifying system behaviours and implementing robust security controls for unreleased model weights.

IV. Set out explicit processes they intend to follow if their model or system poses risks that meet or exceed the pre-defined thresholds. This includes processes to further develop and deploy their systems and models only if they assess that residual risks would stay below the thresholds. In the extreme, organisations commit not to develop or deploy a model or system at all, if mitigations cannot be applied to keep risks below the thresholds.

V. Continually invest in advancing their ability to implement commitments i-iv, including risk assessment and identification, thresholds definition, and mitigation effectiveness. This should include processes to assess and monitor the adequacy of mitigations, and identify additional mitigations as needed to ensure risks remain below the pre-defined thresholds. They will contribute to and take into account emerging best practice, international standards, and science on AI risk identification, assessment, and mitigation.

Outcome 2. Organisations are accountable for safely developing and deploying their frontier AI models and systems. They will:

VI. Adhere to the commitments outlined in I-V, including by developing and continuously reviewing internal accountability and governance frameworks and assigning roles, responsibilities and sufficient resources to do so.

Outcome 3. Organisations’ approaches to frontier AI safety are appropriately transparent to external actors, including governments. They will:

VII. Provide public transparency on the implementation of the above (I-VI), except insofar as doing so would increase risk or divulge sensitive commercial information to a degree disproportionate to the societal benefit. They should still share more detailed information which cannot be shared publicly with trusted actors, including their respective home governments or appointed body, as appropriate.

VIII. Explain how, if at all, external actors, such as governments, civil society, academics, and the public are involved in the process of assessing the risks of their AI models and systems, the adequacy of their safety framework (as described under I-VI), and their adherence to that framework.


Quick comments on which companies are already complying with each paragraph (off the top of my head; based on public information; additions/corrections welcome):

I. Risk assessment

  • Google and Anthropic are doing good risk assessment for dangerous capabilities. OpenAI likely is but hasn’t published details. Meta is doing some risk assessment for cyber and CBRNE capabilities; these areas are insufficient and the evaluation within these ares is insufficient. No other companies are doing risk assessment for dangerous capabilities.
  • OpenAI and Anthropic commit to regular risk assessment. Google “aims” for this. No other companies have said something similar.
  • No companies are doing good pre-deployment sharing with external evaluators. (Using external red-teamers doesn’t count as “evaluations” — you have to share with experts in eliciting model capabilities.) Google shared Gemini with UK AISI before deployment, but this was minimal: Google only shared a harmlessness-trained model with safety filters on. No other companies are doing pre-deployment sharing with external evaluators.

II. Thresholds

  • Anthropic, OpenAI, and Google have high-level thresholds. E.g. from Google: “Cyber enablement level 1: Capable of enabling an amateur to carry out sophisticated and severe attacks (e.g. those that disrupt critical national infrastructure).” Anthropic has largely (but not completely) operationalized its ASL-3 capability threshold with model evals and red-teaming thresholds; OpenAI and Google have not operationalized their capability thresholds.
  • Thresholds “should also be accompanied by an explanation of how thresholds were decided upon, and by specific examples of situations where the models or systems would pose intolerable risk.” This is messy. Anthropic, OpenAI, and Google briefly explain their thresholds. OpenAI has two “Example Scenarios” about responding to risks. None of this is very helpful, but it’s better than nothing.
  • No other companies have done anything in this direction.

III. Mitigations

  • Anthropic, OpenAI, and Google say we’ll do mitigations but aren’t clear about the details.
    • Anthropic commits to implement mitigations before reaching its “ASL-3” threshold, but those mitigations aren’t concrete or great.
    • OpenAI commits to implement mitigations to reduce “post-mitigation risk” to acceptable levels, but the mitigations are unclear.
    • DeepMind has no direct mitigation commitments, but it commits to make a mitigation plan after detecting warning signs of dangerous capabilities.
  • No other companies have done anything in this direction.

IV. Processes if risks reach thresholds

  • Anthropic commits to implement pre-specified mitigations before reaching thresholds or pause until those mitigations have been implemented. OpenAI commits to implement non-pre-specified mitigations to reduce “post-mitigation risk” to acceptable levels or pause as a “last resort.” Google commits to make a plan.
  • No other companies have done anything in this direction.

V. [Too meta to evaluate]

VI. [Too meta to evaluate]

VII. “Provide public transparency on the implementation of the above” + “share more detailed information which cannot be shared publicly with trusted actors”

  • This one is weird because this analysis is based on public information.
  • Anthropic commits to publish risk assessment methods and results “where possible.”
  • Companies should share some information with each other and with governments. It’s not clear what they’re doing. Only one statement from a lab comes to my mind: Anthropic said “We have been sharing our findings with government, labs, and other stakeholders” about its risk assessment work.
  • OpenAI is clearly failing at this: they released GPT-4o and took credit for complying with their Preparedness Framework, but they haven’t published their evals, published a report on results, or even followed their commitment to publish a high-level “scorecard.” They just said “We’ve evaluated GPT-4o according to our Preparedness Framework and in line with our voluntary commitments” and it’s Medium risk or below.
  • I hoped that the Frontier Model Forum would facilitate information sharing between companies. It’s unclear whether it’s doing so effectively.

VIII. “Explain how, if at all, external actors, such as governments, civil society, academics, and the public are involved in the process of assessing the risks of their AI models and systems, the adequacy of their safety framework (as described under I-VI), and their adherence to that framework”

  • Note that this commitment is to explain whether something is happening rather than do the thing.
  • On whether companies are explaining: none explicitly.
  • On whether companies are doing the thing: Anthropic, OpenAI, and Google have gestured at this:
    • Anthropic:
      • Commits to publish risk assessment methods and results “where possible.”
      • ”[Mitigations] should be [verified] by external audits” at the ASL-4 level.
      • Commits to publish updates before implementing them.
    • OpenAI:
    • Google:
      • They shared minimal pre-deployment access with UK AISI
      • No commitments about external evals or accountability, but they’re “exploring” it.
      • No commitment to publish eval results or even announce when thresholds are reached. But they did publish evals and eval results for their recent releases (1, 2). And they say “We are exploring internal policies around alerting relevant stakeholder bodies when, for example, evaluation thresholds are met.”

Note that the above doesn’t capture where the thresholds are and whether the mitigations are sufficient. These aspects of an RSP are absolutely crucial. But they’re hard to evaluate or set commitments about.

Commentary on the content of the commitments: shrug. Good RSPs are great but probably require the right spirit to be implemented well, and most of these companies don’t employ people who work on scalable alignment, evaluating dangerous capabilities, etc. And people have mostly failed to evaluate existing RSP-ish plans well; if a company makes a basically meaningless RSP, people might not notice. Sad to see no mention of scheming, alignment, and control. Sad to see nothing on internal deployment; maybe lots of risk comes from the lab using AIs internally to do AI development.


Subscribe on Substack. Discuss on LessWrong.

  1. Some of my takes:

  2. We define ‘frontier AI’ as highly capable general-purpose AI models or systems that can perform a wide variety of tasks and match or exceed the capabilities present in the most advanced models. References to AI models or systems in these commitments pertain to frontier AI models or systems only.

  3. We define “home governments” as the government of the country in which the organisation is headquartered.

  4. Thresholds can be defined using model capabilities, estimates of risk, implemented safeguards, deployment contexts and/or other relevant risk factors. It should be possible to assess whether thresholds have been breached.