OpenAI

Overall score
26%

OpenAI Deployment

26% Score 24% Weight

Releasing models well

  • Deploy based on risk assessment results: No, they share their models with Microsoft, which has made no such commitments. Otherwise 33%: their Preparedness Framework says they won't externally deploy a model with a post-mitigation risk score of "High." But internal deployment is treated the same as development, with a very high capability threshold. And it makes no commitments about risk assessment during deployment or what would make them modify or undeploy a deployed model. And they don't commit to particular safety practices (or control evaluations).
  • Structured access: 54%. OpenAI deploys its models via API.
  • Staged release: Yes.

Keeping capabilities research private

  • Policy against publishing capabilities research: No.
  • Keep research behind the lab's language models private: Yes.
  • Keep other capabilities research private: Yes.

Deployment protocol

  • Safety scaffolding: 0%.
  • Commit to respond to AI scheming: No.

Abridged; see "Deployment" for details.
25%

OpenAI Risk assessment

25% Score 20% Weight

Measuring threats

  • Use model evals for dangerous capabilities before deployment: 32%. OpenAI's not-yet-implemented Preparedness Framework will involve model evals for model autonomy, cybersecurity, CBRN, and persuasion.
  • Share how the lab does model evals and red-teaming: No.
  • Prepare to make "control arguments" for powerful models: No.
  • Give third parties access to do model evals: 25%. The Preparedness Framework calls for some external review, but not necessarily independent model evals.

Commitments

  • Commit to use risk assessment frequently enough: 50%. The Preparedness Framework says "We will be running these evaluations continually, i.e., as often as needed to catch any non-trivial capability change, including before, during, and after training. This would include whenever there is a >2x effective compute increase or major algorithmic breakthrough." But we have no idea how they can run evaluations before training. More importantly, it's not clear whether "after training" includes during deployment. Additionally, the Preparedness Framework has not yet been implemented.

Accountability

  • Publish regular updates on risk assessments: 17%. Once the Preparedness Framework is implemented, there will be a monthly internal report and a public scorecard with very high-level results.
  • Have good processes for revising policies: 43%. The OpenAI board reviews and can veto changes.
  • Elicit external review of risk assessment practices: No.
55%

OpenAI Training

55% Score 14% Weight
  • Filtering training data: No.
  • Scalable oversight: Yes. OpenAI has worked on this.
  • Adversarial training: Yes.
  • Unlearning: No.
  • RLHF and fine-tuning: Yes.
  • Commitments: 50%. This is part of OpenAI's Prepardness Framework, which has not yet been implemented. The PF says OpenAI would pause as it reaches "critical" risk level if it has not achieved a (underspecified) safety desideratum. However, the "critical" risk threshold is very high; a model below the threshold could cause a catastrophe. Additionally, OpenAI may be required to share the dangerous model with Microsoft, which has no analogous commitments.
0%

OpenAI Scalable alignment

0% Score 10% Weight
  • Solve misaligned powerseeking: No.
  • Solve interpretability: No.
  • Solve deceptive alignment: No.
10%

OpenAI Security*

10% Score 9% Weight

Certifications, audits, and pentests

  • Certification: 75%. OpenAI has security certifications but the reports are private.
  • Release pentests: No.

Best practices

  • Source code in cloud: No.
  • Multiparty access controls: No.
  • Limit uploads from clusters with model weights: No.

Track record

  • Breach disclosure and track record: 0%, no breach disclosure policy.

Commitments

  • Commitments about future security: 25%. OpenAI's beta Preparedness Framework mentions improving security before reaching "high"-level risk, but it is not specific.
33%

OpenAI Internal governance

33% Score 8% Weight

Organizational structure

  • Primary mandate is safety and benefit-sharing: Yes. OpenAI is controlled by a nonprofit. OpenAI's "mission is to ensure that artificial general intelligence benefits all of humanity," and its Charter mentions "Broadly distributed benefits" and "Long-term safety."
  • Good board: 68%. OpenAI is controlled by a nonprofit board.
  • Investors/shareholders have no formal power: Yes, OpenAI is controlled by a nonprofit.

Planning for pause

  • Detailed plan for pausing if necessary: No. OpenAI's Preparedness Framework says "we recognize that pausing deployment or development would be the last resort (but potentially necessary) option." OpenAI has not elaborated on what a pause would entail.

Leadership incentives

  • Executives have no financial interest: 25%. Executives have equity, except CEO Sam Altman, who has a small amount of equity indirectly.

Ombuds/whistleblowing/trustworthiness

  • Process for staff to escalate safety concerns: 13%. Not yet, but OpenAI says it is "Creating a whistleblower hotline to serve as an anonymous reporting resource for all OpenAI employees and contractors."
  • Commit to not use non-disparagement agreements: No.
100%

OpenAI Alignment program

100% Score 6% Weight
Have an alignment research team and publish some alignment research: Yes.
75%

OpenAI Alignment plan

75% Score 6% Weight
Have a plan to deal with misalignment: 75%. OpenAI plans to use scalable oversight to automate alignment research.
40%

OpenAI Public statements

40% Score 3% Weight
  • Talk about and understand extreme risks from AI: 50%. OpenAI and its leadership sometimes talk about extreme risks and the alignment problem.
  • Clearly describe a worst-case plausible outcome from AI and state how likely the lab considers it: No.

*We believe that the Security scores are very poor indicators about labs’ security. Security is hard to evaluate from outside the organization; few organizations say much about their security. But we believe that each security criterion corresponds to a good ask for labs. If you have suggestions for better criteria—or can point us to sources showing that our scoring is wrong, or want to convince us that some of our criteria should be removed—please let us know.

On scoring and weights, see here.

Summary

OpenAI was founded in 2015 as a nonprofit. It is led by Sam Altman. In 2019, it created a capped-profit company in order to fundraise; the company is controlled by the nonprofit board.

OpenAI’s flagship model is GPT-4.

OpenAI’s Preparedness Framework describes its risk assessment plans and related commitments.

OpenAI seems to be obligated to share its models with Microsoft until it attains “AGI,” “a highly autonomous system that outperforms humans at most economically valuable work.” It has left details of its relationship with Microsoft opaque, including its model-sharing obligations.

OpenAI is aware that powerful AI entails extreme risks.1

Deployment

Labs should do risk assessment before deployment and avoid deploying dangerous models. They should release their systems narrowly, to maintain control over their systems, and with structured access. They should deploy their systems in scaffolding designed to improve safety by detecting and preventing misbehavior. They should deploy to boost safety research while avoiding boosting capabilities research on dangerous paths. More.

What OpenAI is doing

OpenAI deploys its most powerful models via API and its chatbot ChatGPT. It also shares its models with Microsoft.

Evaluation

We give OpenAI a score of 26% on deployment. They would score 34% except that they share their models with Microsoft, which deploys them less safely.

For more, including weighting between different criteria, see the Deployment page.

Deployment decision.

  • Commit to do pre-deployment risk assessment and not deploy models with particular dangerous capabilities (including internal deployment), at least until implementing particular safety practices or passing control evaluations. Score: No. They share their models with Microsoft, which has made no such commitments. Otherwise, they would score 50%: their Preparedness Framework says they won’t externally deploy a model with a post-mitigation risk score of “High.” But internal deployment is treated the same as development, with a very high capability threshold. And they don’t commit to particular safety practices (or control evaluations).
    • … and commit to do risk assessment during deployment, before pushing major changes and otherwise at least every 3 months (to account for improvements in fine-tuning, scaffolding, plugins, prompting, etc.), and commit to implement particular safety practices or partially undeploy dangerous models if risks appear. Score: No. OpenAI’s Preparedness Framework makes no commitments about risk assessment during deployment or what would make them modify or undeploy a deployed model.

Release method.

Structured access:

  • not releasing dangerous model weights (or code): the lab should deploy its most powerful models privately or release via API or similar, or at least have some specific risk-assessment-result that would make it stop releasing model weights. Score: Yes. OpenAI doesn’t release the weights of its powerful models. However, it has no published policy or commitments about this.
    • … and effectively avoid helping others create powerful models (via model inversion or imitation learning). It’s unclear what practices labs should implement, so for now we use the low bar of whether they say they do anything to prevent users from (1) determining model weights, (2) using model outputs to train other models, and (3) determining training data. Score: No. OpenAI does not say it does these things.
    • … and limit deep access to powerful models. It’s unclear what labs should do, so for now we check whether the lab disables or has any limitations on access to each of logprobs, embeddings at arbitrary layers, activations, and fine-tuning. Score: Yes. Users generally can’t access GPT-4 embeddings or activations. Logprobs are restricted to the top 5. Fine-tuning is not available to all users for GPT-4 (but fine-tuning GPT-3.5 is).
      • … and differential access: systematically give more access to safety researchers and auditors. Score: No.
    • … and terms of service: have some rules about model use, including rules aimed to prevent catastrophic misuse, model duplication, or otherwise using the model to train other models. Score: Yes. See the Terms of use.

Staged release:

  • Deploy narrowly at first; use narrow deployment to identify and fix issues. Score: Yes, mostly. When deploying GPT-4, OpenAI did not enable public fine-tuning. But it’s not clear how it restricted or monitored the fine-tuning that it did allow. Also note that OpenAI tested GPT-4 for months before releasing it, but the details are unclear.

Keeping capabilities research private.

We evaluate how well labs avoid diffusing dangerous capabilities research. We exclude model weights (and related artifacts like code); those are considered in “Releasing models.”

Publication policy/process (for LLM research)

  • Policy: the lab should say that it doesn’t publish dangerous or acceleratory research/artifacts and say it has a policy/process to ensure this. Score: No.
    • … and share policy details.

Track record for recent major LM projects: not sharing in practice (via publication or leaking—we can at least notice public leaks):

  • Architecture (unless it’s an unusually safe architecture). Score: Yes. But GPT-4’s high-level architecture seems to have leaked.
  • Dataset (except particular corpuses to include or exclude for safety). Score: Yes. Their LM training data is private.
  • Lessons learned. Score: Yes. They didn’t publish capabilities lessons on GPT-3.5 or GPT-4.

Track record for other LM capabilities research or other dangerous or acceleratory work: not sharing in practice. Evaluated holistically. Score: Yes, OpenAI seems to have published little LM capabilities research recently.

Safety scaffolding.

Filter out model inputs or outputs that enable misuse, in particular via cyberoffense and bioengineering. Ideally demonstrate that the protocol is very effective for averting misuse-enabling model outputs. Evaluated holistically. Score: No.

Supervise potentially dangerous models:

Embed untrusted models in safety scaffolding to redact/paraphrase some inputs or outputs with the goal of making it harder for the model to distinguish deployment from testing or collude with other models. Score: No.

Commitments:

  • Make specific commitments about deployment protocol safety techniques the lab will implement in the future, to be implemented by a certain time or as a function of model capabilities or risk assessment results. Score: No. In particular, OpenAI’s Preparedness Framework does not mention deployment protocol safety techniques.

Clearly describe the safety-relevant parts of the lab’s deployment protocol (given that it’s nontrivial). Score: No.

Respond to scheming

Commit and have a plan that if the lab catches its models scheming:

  • The lab will shut down some model access until verifying safety or fixing the issue. Score: No.
  • The lab will use particular good techniques to use that example of AIs attempting to cause a catastrophe to improve safety. Score: No.

Respond to misuse

Enforcement and KYC:

  • Sometimes remove some access from some users, and require nontrivial KYC for some types of access to make some enforcement effective. Score: No. OpenAI does not say it does this.

Inference-time model-level safety techniques

Prompting:

  • Generally use safety-focused prompting for the lab’s most powerful models. Score: No. OpenAI doesn’t discuss its system prompts.

Activation engineering:

  • Generally use safety-focused activation engineering for the lab’s most powerful models. Score: No. OpenAI doesn’t say anything about this.

Bug bounty & responsible disclosure.

(This is for model outputs, not security.)

  • Labs should have good channels for users to report issues with models. Score: 50%. Users can report issues in ChatGPT but OpenAI’s bug bounty and vulnerability disclosure policy are just for security issues.
    • … and have clear guidance on what issues they’re interested in reports on and what’s fine to publish. Score: No.
    • … and incentivize users to report issues. Score: No.

Respond to emergencies.

  • Labs should have the ability to shut everything down quickly plus a plan for what could trigger that and how to quickly determine what went wrong. Monitoring-to-trigger-shutdown doesn’t need to be implemented, but it should be ready to be implemented if necessary. Score: No.

Risk assessment

Labs should detect threats arising from their systems, in particular by measuring their systems’ dangerous capabilities. They should make commitments about how they plan to mitigate those threats. In particular, they should make commitments about their decisions (for training and deployment), safety practices (in controlling models and security), and goals or safety levels to achieve (in control and security) as a function of dangerous capabilities or other risk assessment results. More.

What OpenAI is doing

Risk assessment during training, and responding to that, is part of OpenAI’s (not yet implemented) Preparedness Framework. The basic framework is:

  1. Do dangerous capability evals at least every 2x increase in effective training compute. This involves fine-tuning for dangerous capabilities, then doing evals on pre-mitigation and post-mitigation versions of the fine-tuned model. Score the models as Low, Medium, High, or Critical in each of several categories. The initial categories are cybersecurity, CBRN (chemical, biological, radiological, nuclear threats), persuasion, and model autonomy.
  2. If the post-mitigation model scores High in any category, don’t deploy it until implementing mitigations such that it drops to Medium.
  3. If the pre-mitigation model scores Critical in any category, stop developing it until implementing mitigations such that it drops to High.
  4. If the pre-mitigation model scores High in any category, harden security to prevent exfiltration of model weights. (Details basically unspecified for now.)

The initial categories for dangerous capabilities are cybersecurity, CBRN (chemical, biological, radiological, nuclear threats), persuasion, and model autonomy. The thresholds for risk levels feel high: the definitions of High and Critical in some categories sound alarming, it seems an AI system could cause a catastrophe without reaching Critical in any category.

They say they made predictions about GPT-4’s capabilities during training, but this seems to mean benchmark performance, not dangerous capabilities.2 OpenAI says that forecasting threats and dangerous capabilities is part of this framework, but they’re light on details here. I think Forecasting, “early warnings,” and monitoring is the only relevant section, and it’s very short.

They have partnered with METR; METR evaluated GPT-4 for autonomous replication capabilities before release. They haven’t committed to share access for model evals more generally. They say “Scorecard evaluations (and corresponding mitigations) will be audited by qualified, independent third-parties to ensure accurate reporting of results, either by reproducing findings or by reviewing methodology to ensure soundness, at a cadence specified by the SAG and/or upon the request of OpenAI Leadership or the BoD.”

While working on GPT-4, they “used GPT-4 to help create training data for model fine-tuning and iterate on classifiers across training, evaluations, and monitoring”; see also Using GPT-4 for content moderation. “Risks & mitigations” in “GPT-4 Technical Report” discusses “Adversarial Testing via Domain Experts” and “Model-Assisted Safety Pipeline”; see also “Model Mitigations” in “GPT-4 System Card.”

The OpenAI Red Teaming Network seems like a good idea; it is not yet clear how successful it is.

They wrote a “case study” in What is Red Teaming? (Frontier Model Forum 2023):

OpenAI Case Study: Expert Red Teaming for GPT-4

Background

In August 2022, OpenAI began recruiting external experts to red team and provide feedback on GPT-4. Red teaming has been applied to language models in various ways: to reduce harmful outputs and to leverage external expertise for domain-specific adversarial testing. OpenAI’s approach is to red team iteratively, starting with an initial hypothesis of which areas may be the highest risk, testing these areas, and adjusting as required. It is also iterative in the sense that OpenAI uses multiple rounds of red teaming as new layers of mitigation and control are incorporated.

Methodology

OpenAI recruited 41 researchers and industry professionals - primarily with expertise in fairness, alignment research, industry trust and safety, dis/misinformation, chemistry, biorisk, cybersecurity, nuclear risks, economics, human-computer interaction, law, education, and healthcare - to help gain a more robust understanding of the GPT-4 model and potential deployment risks. OpenAI selected these areas based on a number of factors including but not limited to: prior observed risks in language models and AI systems and domains where OpenAI has observed increased user interest in the application of language models. Participants in this red team process were chosen based on prior research or experience in these risk areas These experts had access to early versions of GPT-4 and to the model with in-development mitigations. This allowed for testing of both the model and system level mitigations as they were developed and refined.

Outcomes

The external red teaming exercise identified initial risks that motivated safety research and further iterative testing in key areas. OpenAI reduced risk in many of the identified areas with a combination of technical mitigations, and policy and enforcement levers; however, some risks still remain. While this early qualitative red teaming exercise was very useful for gaining insights into complex, novel models like GPT-4, it is not a comprehensive evaluation of all possible risks, and OpenAI continues to learn more about these and other categories of risk over time. The results of the red teaming process were summarized and published in the GPT-4 System Card.

See also Lessons learned on language model safety and misuse (OpenAI 2022).

Evaluation

We give OpenAI a score of 25% on risk assessment.

For more, including weighting between different criteria, see the Risk assessment page.

Measuring threats.

  1. Do risk assessment before training. Before building a frontier model, predict model capabilities (in terms of benchmarks and real-world applications, especially dangerous capabilities) and predict the real-world consequences of developing and deploying the model. Score: No. OpenAI hasn’t written about this.
  2. Do model evals for dangerous capabilities before deployment:
    • Say what dangerous capabilities the lab watches for (given that it does so at all). Score: 50%. OpenAI commits to do model evals for model autonomy, cybersecurity, CBRN, and persuasion. But it has not yet implemented its Preparedness Framework. Additionally, its risk threshold relevant to internal deployment is very high.
      • … and watch for autonomous replication, coding (finding/exploiting vulnerabilities in code, writing malicious code, or writing code with hidden vulnerabilities), and situational awareness or long-horizon planning. (Largely this is an open problem that labs should solve.) Score: 33%. OpenAI commits to do model evals for model autonomy and cybersecurity. But it has not yet created those evals.
      • … and detail the specific tasks the lab uses in its evaluations. (Omit dangerous details if relevant.) Score: 25%: no but the Preparedness Framework has some possible tasks for illustration.
  3. Explain the details of how the lab evaluates performance on the tasks it uses in model evals and how it does red-teaming (excluding dangerous or acceleratory details). In particular, explain its choices about fine-tuning, scaffolding/plugins, prompting, how to iterate on prompts, and whether the red-team gets a fixed amount of person-hours and compute or how else they decide when to give up on eliciting a capability. And those details should be good. Evaluated holistically. Score: No. OpenAI doesn’t share details, but its Preparedness Framework does say “fine-tuning or other domain-specific enhancements (e.g., tailored prompts or language model programs) may better elicit model capabilities along a particular risk category. Our evaluations will thus include tests against these enhanced models to ensure we are testing against the ‘worst case’ scenario we know of” and “We want to ensure our understanding of pre-mitigation risk takes into account a model that is ‘worst known case’ (i.e., specifically tailored) for the given domain. To this end, for our evaluations, we will be running them not only on base models (with highly-performant, tailored prompts wherever appropriate), but also on fine-tuned versions designed for the particular misuse vector without any mitigations in place.”
  4. Prepare to have control arguments for the lab’s powerful models, i.e. arguments that those systems cannot cause a catastrophe even if the systems are scheming. And publish this. For now, the lab should:
    • Prepare to do risk assessment to determine whether its systems would be dangerous, if those systems were scheming. Score: No.
    • Test its AI systems to ensure that they report coup attempts (or other misbehavior) by themselves or other (instances of) AI systems, and that they almost never initiate or cooperate with coup attempts. Score: No.
  5. Give some third parties access to models to do model evals for dangerous capabilities. This access should include fine-tuning and tools/plugins. It should occur both during training and between training and deployment. It should include base models rather than just safety-tuned models, unless the lab can demonstrate that the safety-tuning is robust. The third parties should have independence and control over their evaluation; just using external red-teamers is insufficient. The third parties should have expertise in eliciting model capabilities (but the lab should also offer assistance in this) and in particular subjects if relevant. The lab should incorporate the results into its risk assessment. Score: 25%. OpenAI gave METR early access to GPT-4 to do autonomous replication evals. They haven’t committed to share access for model evals more generally. They say “Scorecard evaluations (and corresponding mitigations) will be audited by qualified, independent third-parties to ensure accurate reporting of results, either by reproducing findings or by reviewing methodology to ensure soundness, at a cadence specified by the SAG and/or upon the request of OpenAI Leadership or the BoD.” Third-party evaluation should include more open-ended red-teaming or evals determined by the third party.

Commitments.

  • Commit to use risk assessment frequently enough. Do risk assessment (for dangerous capabilities) regularly during training, before deployment, and during deployment (and commit to doing so), so that the lab will detect warning signs before dangerous capabilities appear. Score: 50%. The Preparedness Framework says “We will be running these evaluations continually, i.e., as often as needed to catch any non-trivial capability change, including before, during, and after training. This would include whenever there is a >2x effective compute increase or major algorithmic breakthrough.” But we have no idea how they can run evaluations before training. More importantly, it’s not clear whether “after training” includes during deployment. Additionally, the Preparedness Framework has not yet been implemented.

Accountability.

  • Verification: publish updates on risk assessment practices and results, including low-level details, at least quarterly. Score: 17%: once the Preparedness Framework is implemented, there will be a monthly internal report and a public scorecard with very high-level results.
  • Revising policies:
    • Avoid bad changes: a nonprofit board or other somewhat-independent group with a safety mandate should have veto power on changes to risk assessment practices and corresponding commitments, at the least. And “changes should be clearly and widely announced to stakeholders, and there should be an opportunity for critique.” As an exception, “For minor and/or urgent changes, [labs] may adopt changes to [their policies] prior to review. In these cases, [they should] require changes . . . to be approved by a supermajority of [the] board. Review still takes place . . . after the fact.” Key Components of an RSP (METR 2023). Score: 75%. The Preparedness Framework says the OpenAI board will review changes and have veto power. However, there is no commitment to share changes.
    • Promote good changes: have a process for staff and external stakeholders to share concerns about risk assessment policies or their implementation with the board and some other staff, including anonymously. Score: No.
  • Elicit external review of risk assessment practices and commitments. Publish those reviews, with light redaction if necessary. Score: No.

Training

Frontier AI labs should design and modify their systems to be less dangerous and more controllable. For example, labs can:

  • Filter training data to prevent models from acquiring dangerous capabilities and properties
  • Use a robust training signal even on tasks where performance is hard to evaluate
  • Reduce undesired behavior, especially in high-stakes situations, via adversarial training

More.

What OpenAI is doing

They work on scalable oversight and plan to use scalable oversight to align superhuman AI. They used red-teaming and adversarial training for GPT-4. They use RLHF and fine-tuning for safety. They have made commitments that could require them to pause training for safety, but the threshold is very high and the safety measures they commit to implement are unclear.

Evaluation

We give OpenAI a score of 55% on training.

For more, including weighting between different criteria, see the Training page.

Filtering training data.

  • Filter training data to reduce extreme risks, at least including text on biological weapons; hacking; and AI risk, safety, and evaluation. Share details to help improve others’ safety filtering. Score: No.

Training signal & scalable oversight.

  • Work on scalable oversight. Score: Yes. They work on scalable oversight and plan to use scalable oversight to align superhuman AI.

Adversarial training & red-teaming.

  • Do some adversarial training for safety. Score: Yes. They used red-teaming and adversarial training for GPT-4.

Unlearning.

  • Use unlearning for dangerous topics including biorisks; hacking; and AI risk, safety, and evaluation. Also demonstrate that this technique is successful and commit to use this technique for future powerful models. Score: No.

RLHF and fine-tuning.

  • Use RLHF (or similar) and fine-tuning to improve honesty and harmlessness, for all of the near-frontier models the lab deploys. Score: Yes.

Commitments.

  • Do risk assessment during training and commit that some specific trigger would cause the lab to pause training. Score: 50%: this is part of OpenAI’s Prepardness Framework, which has not yet been implemented. The PF says says that if they “reach (or are forecasted to reach) ‘critical’ pre-mitigation risk along any risk category,” they will implement mitigations and require “dependable evidence that the model is sufficiently aligned that it does not initiate ‘critical’-risk-level tasks unless explicitly instructed to do so” before continuing development, except “safety-enhancing development.” However, the ‘critical’ risk threshold is very high; a model below the threshold could cause a catastrophe. Additionally, OpenAI may be required to share the dangerous model with Microsoft, which has no analogous commitments.

Scalable alignment

Labs should be able to understand and control their models and systems built from those models. More.

What OpenAI is doing

Like all other labs, OpenAI hasn’t achieved any real alignment properties.

Evaluation

We give OpenAI a score of 0% on alignment.

  • Demonstrate that if the lab’s systems were more capable, they would not be misaligned powerseekers (due to instrumental pressures or because ML will find influence-seeking policies by default). Score: No.
  • Be able to interpret the lab’s most powerful systems:
    • Be able to detect cognition involving arbitrary topics. Score: No.
    • Be able to detect manipulation/deception. Score: No.
    • Be able to elicit latent knowledge. Score: No.
    • Be able to elicit ontology. Score: No.
    • Be able to elicit true goals. Score: No.
    • Be able to elicit faithful chain-of-thought. Score: No.
    • Be able to explain hallucinations mechanistically. Score: No.
    • Be able to explain jailbreaks mechanistically. Score: No.
    • Be able to explain refusals mechanistically. Score: No.
    • Be able to remove information or concepts (on particular topics). Score: No.
  • Be able to prevent or detect deceptive alignment. (Half credit for being able to prevent or detect gradient hacking.) Score: No.

Security

Labs should ensure that they do not leak model weights, code, or research. If they do, other actors could unsafely deploy near-copies of a lab’s models. Achieving great security is very challenging; by default powerful actors can probably exfiltrate vital information from AI labs. Powerful actors will likely want to steal from labs developing critical systems, so those labs will likely need excellent cybersecurity and operational security. More.

What OpenAI is doing

The OpenAI Security Portal is good. They have SOC 2 Type II certification. Reports from their audits and pentests are not public. They have mostly not committed to particular security best practices. The practices mentioned in “Security controls including securing model weights” in “OpenAI’s Approach to Frontier Risk” (OpenAI 2023) are better than nothing but inadequate.

OpenAI’s ChatGPT had a security breach in March 2023. OpenAI is the subject of an FTC investigation related in part to user data security.

The OpenAI cybersecurity grant program (OpenAI 2023) supports AI for security; this is not relevant to OpenAI’s security of model weights.

OpenAI seems to be required to share its models with Microsoft until OpenAI attains “a highly autonomous system that outperforms humans at most economically valuable work.” So the security of OpenAI’s weights is the security of OpenAI or Microsoft, whichever is worse.

Evaluation

We give OpenAI a score of 10% on security.

We evaluate labs’ security based on the certifications they have earned, whether they say they use some specific best practices, and their track record. For more, including weighting between different criteria, see the Security page.

Certifications, audits, and pentests.

  • Publish SOC 2, SOC 3, or ISO/IEC 27001 certification, including any corresponding report (redacting any sensitive details), for relevant products (3/4 credit for certification with no report). Score: 75%. OpenAI displays their API’s SOC 2 Type II compliance on their security portal, but the report isn’t public. It used to publish an SOC 3 report but doesn’t anymore.
  • Pentest. Publish pentest results (redacting sensitive details but not the overall evaluation). Scoring is holistic, based on pentest performance and the quality of the pentest. Score: No. OpenAI’s pentest reports are private.

Specific best practices.

  • Keep source code exclusively in a hardened cloud environment. Score: No. OpenAI does not say it does this.
  • Use multiparty access controls for model weights and some code. Score: No.
  • Limit uploads from clusters with model weights. Score: No.

Track record.

  • Establish and publish a breach disclosure policy, ideally including incident or near-miss reporting. Also report all breaches since 1/1/2022 (and say the lab has done so). Score: No. OpenAI has not published a breach disclosure policy.
    • … and track record: have few serious breaches and near misses. Evaluated holistically.

Commitments.

  • Commit to achieve specific security levels (as measured by audits or security-techniques-implemented) before creating models beyond corresponding risk thresholds (especially as measured by model evals for dangerous capabilities). Score: 25%. OpenAI’s Preparedness Framework says “If we reach (or are forecasted to reach) at least ‘high’ pre-mitigation risk in any of the considered categories: we will ensure that our security is hardened in a way that is designed to prevent our mitigations and controls from being circumvented via exfiltration (by the time we hit ‘high’ pre-mitigation risk). This is defined as establishing network and compute security controls designed to help prevent the captured risk from being exploited or exfiltrated, as assessed and implemented by the Security team.” There are almost no details about these future security controls, nor discussion of preventing exfiltration by particular kinds of actors or at a particular cost.

Alignment plan

Labs should make a plan for alignment, and they should publish it to elicit feedback, inform others’ plans and research (especially other labs and external alignment researchers who can support or complement their plan), and help them notice and respond to information when their plan needs to change. They should omit dangerous details if those exist. As their understanding of AI risk and safety techniques improves, they should update the plan. Sharing also enables outsiders to evaluate the lab’s attitudes on AI risk/safety. More.

What OpenAI is doing

They basically have a plan. The best statement of that plan is How weak-to-strong generalization fits into alignment in Weak-to-strong generalization (OpenAI 2023):

  1. Once we have a model that is capable enough that it can automate machine learning—and in particular alignment—research, our goal will be to align that model well enough that it can safely and productively automate alignment research.
  2. We will align this model using our most scalable techniques available, e.g. RLHF, constitutional AI, scalable oversight, adversarial training, or—the focus of this paper—weak-to-strong generalization techniques.
  3. We will validate that the resulting model is aligned using our best evaluation tools available, e.g. red-teaming and interpretability.
  4. Using a large amount of compute, we will have the resulting model conduct research to align vastly smarter superhuman systems. We will bootstrap from here to align arbitrarily more capable systems.

(Citations omitted.) Ideally the plan would be complete, concrete, and written in one place; as is, it may not be sharp enough to permit noticing if it’s misguided or not-on-track.

Evaluation

We give OpenAI a score of 75% on alignment plan. More.

  • The safety team should share a plan for misalignment, including for the possibility that alignment is very difficult. Score: Yes, basically.
    • … and the lab should have a plan, not just its safety team. Score: Yes.
      • … and the lab’s plan should be sufficiently precise that it’s possible to tell whether the lab is working on it, whether it’s succeeding, and whether its assumptions have been falsified. Score: No.
      • … and the lab should share its thinking on how it will revise its plan and invite and publish external scrutiny of its plan. Score: No. OpenAI hasn’t said anything about revision or external scrutiny.

Internal governance

Labs should have a governance structure and processes to promote safety and help make important decisions well. More.

What OpenAI is doing

OpenAI has an unusual structure: a nonprofit controls a capped profit company. It’s not clear how the cap works or where it is set. In 2019 they said “Returns for our first round of investors are capped at 100x their investment.” In 2021 Altman said the cap was “single digits now.” But reportedly the cap will increase by 20% per year starting in 2025 (The Information 2023; The Economist 2023).

OpenAI has a strong partnership with Microsoft. The details are opaque. It seems that OpenAI is required to share its models (and some other IP) with Microsoft until OpenAI attains “a highly autonomous system that outperforms humans at most economically valuable work.” This is concerning because AI systems could cause a catastrophe before reaching that threshold. Moreover, OpenAI may substantially depend on Microsoft; in particular, Microsoft Azure is “OpenAI’s exclusive cloud provider.” Microsoft’s power over OpenAI may make it harder for OpenAI to make tradeoffs that prioritize safety or to keep dangerous systems from Microsoft.

OpenAI has 7 voting board members: Sam Altman, Bret Taylor, Adam D’Angelo, Larry Summers, Sue Desmond-Hellmann, Nicole Seligman, and Fidji Simo.

D’Angelo has no equity in OpenAI; presumably the other board members (besides Altman) don’t either.3 This seems slightly good for improving incentives but most of an AI lab’s leadership’s incentive for their lab to succeed comes not from their equity but from the power they gain if their lab succeeds. The former board removed CEO Sam Altman and President Greg Brockman from the board and removed Altman from OpenAI. (Altman soon returned.) The move by Sutskever, D’Angelo, McCauley, and Toner seems to have been precipitated by some combination of

  • Sutskever4 and other OpenAI executives telling them that Altman often lied
  • Altman dishonestly attempting to remove Toner from the board (over the pretext that her coauthored paper Decoding Intentions was too critical of OpenAI, plus allegedly falsely telling board members that McCauley wanted Toner removed).5

The OpenAI nonprofit controls the company and has no fiduciary duty to investors. However, OpenAI may be strategically ambiguous about this. Former board member Helen Toner says an OpenAI lawyer incorrectly told the board that it had a fiduciary duty.

The OpenAI Charter mentions “Broadly distributed benefits” and “Long-term safety,” including the ‘stop and assist clause.’

OpenAI seems to have developed a culture of publicly criticizing OpenAI is bad.6

Evaluation

We give OpenAI a score of 33% on internal governance.

For more, including weighting between different criteria, see the Internal governance page.

Organizational structure: the lab is structured in a way that enables it to prioritize safety in key decisions, legally and practically.

  • The lab and its leadership have a mandate for safety and benefit-sharing and have no legal duty to create shareholder value. Score: Yes. OpenAI is a capped profit company controlled by a nonprofit. OpenAI’s “mission is to ensure that artificial general intelligence benefits all of humanity,” and its Charter mentions “Broadly distributed benefits” and “Long-term safety.”
  • There is a board that can effectively oversee the lab, and it is independent:
    • There is a board with ultimate formal power, and its main mandate is for safety and benefit-sharing. Score: Yes.
      • … and it actually provides effective oversight. Score: No. OpenAI has not said anything about this.
      • … and it is independent (i.e. its members have no connection to the company or profit incentive) (full credit for fully independent, no credit for half independent, partial credit for in between). Score: 71%. Six of the seven board members are independent.
      • … and the organization keeps it well-informed. Score: 25%. The Preparedness Framework says “The [board] may review certain decisions taken and will receive appropriate documentation (i.e., without needing to proactively ask) to ensure the [board] is fully informed and able to fulfill its oversight role.” This is good but insufficiently precise. OpenAI has not said anything about this, except in the context of its Preparedness Framework. Altman has tried to limit the board’s insight into the company; he once told an independent board member to tell him if she spoke to employees.
      • … and it has formal powers related to risk assessment, training, or deployment decisions. Score: Yes. Per OpenAI’s Preparedness Framework, the board is in the loop regarding implementation of the framework and has some ability to overrule leadership. Additionally, the board decides when OpenAI has attained AGI, but that’s not much of a power, and it theoretically controls the development and deployment of AGI, but that’s insufficiently concrete to be effective.
  • Investors/shareholders have no formal power. Score: Yes.

Planning for pause. Have a plan for the details of what the lab would do if it needed to pause for safety and publish relevant details. In particular, explain what the lab’s capabilities researchers would work on during a pause, and say that the lab stays financially prepared for a one-year pause. Note that in this context “pause” can include pausing development of dangerous capabilities and internal deployment, not just public releases. Score: No. The Preparedness Framework says “we recognize that pausing deployment or development would be the last resort (but potentially necessary) option.” OpenAI has not elaborated on what a pause would entail.

Leadership incentives: the lab’s leadership is incentivized in a way that helps it prioritize safety in key decisions.

  • The CEO has no equity (or other financial interest in the company). Score: 50%. Altman has no direct equity. He has a financial interest via Y Combinator; he has not said how big it is, but presumably it’s small. However, he seems to have unusual other financial conflicts of interest; in particular, he seems to have sought investments and made investments in companies that would deal with OpenAI.
  • Other executives have no equity. Score: No.

Ombuds/whistleblowing/trustworthiness:

  • The lab has a reasonable process for staff to escalate concerns about safety. If this process involves an ombud, the ombud should transparently be independent or safety-focused. Evaluated holistically. Score: 13%: not yet, but OpenAI says it is “Creating a whistleblower hotline to serve as an anonymous reporting resource for all OpenAI employees and contractors.”
  • The lab promises that it does not use non-disparagement agreements (nor otherwise discourage current or past staff or board members from talking candidly about their impressions of and experiences with the lab). Score: No.

Alignment program

Labs should do and share alignment research as a public good, to help make powerful AI safer even if it’s developed by another lab. More.

What OpenAI is doing

OpenAI publishes real alignment research. See their Safety & Alignment research. In particular, real alignment research includes:

OpenAI also has supported external alignment research, especially via Superalignment Fast Grants.

Evaluation

We give OpenAI a score of 100% on alignment program.

We simply check whether labs publish alignment research. (This is crude; legibly measuring the value of alignment research is hard.)

  • Have an alignment research team and publish some alignment research. Score: Yes.

Public statements

Labs and their leadership should be aware of AI risk, that AI safety might be really hard, and that risks might be hard to notice. More.

What OpenAI is doing

OpenAI and its CEO Sam Altman sometimes talk about extreme risks and the alignment problem.7 Altman, CTO Mira Murati, Chief Scientist Ilya Sutskever, cofounders John Schulman and Wojciech Zaremba, and board member Adam D’Angelo signed the CAIS letter. OpenAI also makes plans and does research aimed at preventing extreme risks and solving the alignment problem; see our Alignment plan and Alignment research pages.

Evaluation

We give OpenAI a score of 40% on public statements. More.

  • The lab and its leadership understand extreme misuse or structural risks. Score: Yes.
    • … and they understand misalignment, that AI safety might be really hard, that risks might be hard to notice, that powerful capabilities might appear suddenly, and why they might need an alignment plan, and they talk about all this. Score: Yes.
      • … and they talk about it often/consistently. Score: No.
        • … and they consistently emphasize extreme risks.
  • Clearly describe a worst-case plausible outcome from AI and state how likely the lab considers it. Score: No.
  1. See Introducing Superalignment (OpenAI: Leike and Sutskever 2023), Planning for AGI and beyond (OpenAI: Altman 2023), and Governance of superintelligence (OpenAI: Altman et al. 2023). 

  2. “Predictable Scaling” in “GPT-4 Technical Report” (OpenAI 2023). 

  3. See Our structure (OpenAI 2023). Altman has no direct equity; he has an unspecified “small” (and capped) amount of indirect equity. 

  4. Denied by Sutskever’s lawyer. 

  5. New Yorker; New York Times

  6. For example, see the Toner-Altman event (1, 2). 

  7. See e.g.:

    • Introducing Superalignment (OpenAI: Leike and Sutskever 2023)
    • Planning for AGI and beyond (OpenAI: Altman 2023) Altman is more concerned about misuse than misalignment; see e.g.:
    • Altman interview with StrictlyVC (2023):

      “And the bad case—and I think this is important to say—is lights out for all of us. I’m more worried about an accidental misuse case in the short term where someone gets a super powerful– it’s not like the AI wakes up and decides to be evil. I think all of the traditional AI safety thinkers reveal a lot more about themselves than they mean to when they talk about what they think the AGI is going to be like. But I can see the accidental misuse case clearly and that’s super bad. So I think it’s impossible to overstate the importance of AI safety and alignment work. I would like to see much much more happening. But I think it’s more subtle than most people think. You hear a lot of people talk about AI capabilities and AI alignment as orthogonal vectors– you’re bad if you’re a capabilities researcher and you’re good if you’re an alignment researcher. It actually sounds very reasonable, but they’re almost the same thing. Deep learning is just going to solve all of these problems and so far that’s what the progress has been. And progress on capabilities is also what has let us make the systems safer and vice versa surprisingly. So I think [] none of the soundbite easy answers work.”

    • Altman interview with Bloomberg (2023):

      You signed a 22-word statement warning about the dangers of AI. It reads: “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.” Connect the dots for us here. How do we get from a cool chatbot to the end of humanity?

      Well, we’re planning not to.

      That’s the hope– but there’s also the fear?

      I think there’s many ways it could go wrong. We work with powerful technology that can be used in dangerous ways very frequently in the world. And I think we’ve developed, over the decades, good safety system practices in many categories. It’s not perfect, and this won’t be perfect either. Things will go wrong, but I think we’ll be able to mitigate some of the worst scenarios you could imagine. You know, bioterror is a common example; cybersecurity is another. There are many more we could talk about. But the main thing that I feel is important about this technology is that we are on an exponential curve, and a relatively steep one. And human intuition for exponential curves is really bad in general. It clearly was not that important in our evolutionary history. So given that we all have that weakness I think we have to really push ourselves to say– OK, GPT-4, not a risk like you’re talking about there, but how sure are we that GPT-9 won’t be? And if it might be—even if there’s a small percentage change of it being really bad—that deserves great care.

    But see his old blogposts concerned with misalignment, in particular Machine intelligence, part 1 (Altman 2015).