Subscribe to my newsletter

We collect actions for frontier Al labs to avert extreme risks from AI, then evaluate particular labs accordingly.

These criteria are not exhaustive or sufficient: they’re the best we’ve got given what we currently know, given that they have to be concrete, and given that they have to be evaluated based on public information. Some important variables are hard to measure, especially from outside the labs.

This website is a low-key research preview. We’re sharing it to elicit feedback and help determine which direction to take it. Although it’s in beta, we endorse the content (modulo a few criteria where it’s hard to find sources) and you are welcome to share and discuss it publicly. Leave feedback here.

Weight
100%
Deployment

Avoid dangers in deployment, and boost safety but avoid boosting dangerous research

3%

Microsoft Deployment

3% Score 24% Weight

Releasing models well

  • Deploy based on risk assessment results: No.
  • Structured access: No, Microsoft releases its model weights, and it has not published a capability/risk threshold past which it would stop nor any other plan to notice if releasing was dangerous.
  • Staged release: 25%.

Keeping capabilities research private

  • Policy against publishing capabilities research: No.
  • Keep research behind the lab's language models private: 8%.
  • Keep other capabilities research private: No.

Deployment protocol

  • Safety scaffolding: 1%. Microsoft uses content filtering but not to avert misuse, and it does not use other safety scaffolding techniques, and it releases its model weights anyway.
  • Commit to respond to AI scheming: No.

Abridged; see "Deployment" for details.
24%

DeepMind Deployment

24% Score 24% Weight

Releasing models well

  • Deploy based on risk assessment results: 25%. DeepMind's Frontier Safety Framework is good but insufficient.
  • Structured access: 54%. DeepMind deploys its most powerful models via API.
  • Staged release: Yes.

Keeping capabilities research private

  • Policy against publishing capabilities research: No.
  • Keep research behind the lab's language models private: 25%, DeepMind publishes most of this research, including on its Gemini models.
  • Keep other capabilities research private: No.

Deployment protocol

  • Safety scaffolding: 4%. DeepMind uses filters but not to avert misuse, and it does not use other safety scaffolding techniques.
  • Commit to respond to AI scheming: No.

Abridged; see "Deployment" for details.
4%

Meta Deployment

4% Score 24% Weight

Releasing models well

  • Deploy based on risk assessment results: No.
  • Structured access: No, Meta AI releases its model weights, and it has not published a capability/risk threshold past which it would stop nor any other plan to notice if releasing was dangerous.
  • Staged release: No.

Keeping capabilities research private

  • Policy against publishing capabilities research: No.
  • Keep research behind the lab's language models private: 25%. Meta AI publishes most of this research, but omits details about training data.
  • Keep other capabilities research private: No.

Deployment protocol

  • Safety scaffolding: 1%. It uses filtering in its chatbot, Meta AI, but not to avert misuse, and it does not use other safety scaffolding techniques, and it releases its model weights anyway.
  • Commit to respond to AI scheming: No.

Abridged; see "Deployment" for details.
26%

OpenAI Deployment

26% Score 24% Weight

Releasing models well

  • Deploy based on risk assessment results: No, they share their models with Microsoft, which has made no such commitments. Otherwise 33%: their Preparedness Framework says they won't externally deploy a model with a post-mitigation risk score of "High." But internal deployment is treated the same as development, with a very high capability threshold. And it makes no commitments about risk assessment during deployment or what would make them modify or undeploy a deployed model. And they don't commit to particular safety practices (or control evaluations).
  • Structured access: 54%. OpenAI deploys its models via API.
  • Staged release: Yes.

Keeping capabilities research private

  • Policy against publishing capabilities research: No.
  • Keep research behind the lab's language models private: Yes.
  • Keep other capabilities research private: Yes.

Deployment protocol

  • Safety scaffolding: 0%.
  • Commit to respond to AI scheming: No.

Abridged; see "Deployment" for details.
61%

Anthropic Deployment

61% Score 24% Weight

Releasing models well

  • Deploy based on risk assessment results: 92%. Risk assessment results affect deployment decisions per Anthropic's Responsible Scaling Policy. The RSP also mentions "de-deployment" if Anthropic realizes that a model is much more capable than it believed, but it does not have great details or commitments. And the RSP currently contains a "drafting error" suggesting that risk assessment does not occur during deployment.
  • Structured access: 54%. Anthropic deploys its models via API.
  • Staged release: Yes.

Keeping capabilities research private

  • Policy against publishing capabilities research: 67%. They have a policy but haven't shared the details.
  • Keep research behind the lab's language models private: Yes.
  • Keep other capabilities research private: Yes.

Deployment protocol

  • Safety scaffolding: 28%. Anthropic uses filtering and has made commitments to use safety scaffolding techniques in the future.
  • Commit to respond to AI scheming: No.

Abridged; see "Deployment" for details.
24%
Risk assessment

Predict models' dangerous capabilities and check for warning signs for those capabilities

1%

Microsoft Risk assessment

1% Score 20% Weight

Measuring threats

  • Use model evals for dangerous capabilities before deployment: 3%. Microsoft says "red team testing will include testing of dangerous capabilities, including related to biosecurity and cybersecurity," but it doesn't share details, have model evals, or test for autonomous replication or situational awareness capabilities.
  • Share how the lab does model evals and red-teaming: No.
  • Prepare to make "control arguments" for powerful models: No.
  • Give third parties access to do model evals: No.

Commitments

  • Commit to use risk assessment frequently enough: No.

Accountability

  • Publish regular updates on risk assessments: No.
  • Have good processes for revising policies: No.
  • Elicit external review of risk assessment practices: No.
37%

DeepMind Risk assessment

37% Score 20% Weight

Measuring threats

  • Use model evals for dangerous capabilities before deployment: 93%. DeepMind has good model evals, does them somewhat regularly, and shares relevant details.
  • Share how the lab does model evals and red-teaming: 50%.
  • Prepare to make "control arguments" for powerful models: 13%.
  • Give third parties access to do model evals: 10%. DeepMind shared Gemini Ultra with external groups before deployment, but it didn't give them deep access. DeepMind has not shared results of this testing.

Commitments

  • Commit to use risk assessment frequently enough: 25%.

Accountability

  • Publish regular updates on risk assessments: 25%.
  • Have good processes for revising policies: No.
  • Elicit external review of risk assessment practices: No.
8%

Meta Risk assessment

8% Score 20% Weight

Measuring threats

  • Use model evals for dangerous capabilities before deployment: 33%. Meta AI tested Llama 3 for some hacking capabilities and open-sourced this evaluation.
  • Share how the lab does model evals and red-teaming: No.
  • Prepare to make "control arguments" for powerful models: No.
  • Give third parties access to do model evals: No.

Commitments

  • Commit to use risk assessment frequently enough: No.

Accountability

  • Publish regular updates on risk assessments: No.
  • Have good processes for revising policies: No.
  • Elicit external review of risk assessment practices: No.
45%

OpenAI Risk assessment

45% Score 20% Weight

Measuring threats

  • Use model evals for dangerous capabilities before deployment: 58%. OpenAI's Preparedness Framework involves model evals for model autonomy, cybersecurity, CBRN, and persuasion.
  • Share how the lab does model evals and red-teaming: No.
  • Prepare to make "control arguments" for powerful models: No.
  • Give third parties access to do model evals: 25%. OpenAI shared access to o1-preview with several external evaluators, but some details are bad or unclear. The Preparedness Framework calls for some external review, but not necessarily independent model evals.

Commitments

  • Commit to use risk assessment frequently enough: 75%. The Preparedness Framework says "We will be running these evaluations continually, i.e., as often as needed to catch any non-trivial capability change, including before, during, and after training. This would include whenever there is a >2x effective compute increase or major algorithmic breakthrough." But we have no idea how they can run evaluations before training. More importantly, it's not clear whether "after training" includes during deployment. Additionally, the Preparedness Framework has not yet been implemented.

Accountability

  • Publish regular updates on risk assessments: 50%. There is supposedly a monthly internal report. There is a public scorecard with very high-level results. OpenAI generally publishes system cards alongside major model releases.
  • Have good processes for revising policies: 43%. The OpenAI board reviews and can veto changes.
  • Elicit external review of risk assessment practices: No.
47%

Anthropic Risk assessment

47% Score 20% Weight

Measuring threats

  • Use model evals for dangerous capabilities before deployment: 65%. Anthropic does model evals for autonomous replication and adaption, cyber capabilities, and biology capabilities.
  • Share how the lab does model evals and red-teaming: 25%.
  • Prepare to make "control arguments" for powerful models: No.
  • Give third parties access to do model evals: 25%. Anthropic shared pre-deployment access for the new Claude 3.5 Sonnet with US AISI, UK AISI, and METR. The depth of access is unclear. It has made some commitment to US AISI but its future plans are unclear.

Commitments

  • Commit to use risk assessment frequently enough: Yes. Anthropic commits to do risk assessment every 4x increase in effective training compute and every 3 months.

Accountability

  • Publish regular updates on risk assessments: 33%, just internally.
  • Have good processes for revising policies: 61%. Changes to the RSP must be approved by the board and published before they are implemented. And Anthropic has a "non-compliance reporting policy."
  • Elicit external review of risk assessment practices: No.
20%
Training

Make new systems safer, create them cautiously, and advance safer kinds of systems

9%

Microsoft Training

9% Score 14% Weight
  • Filtering training data: No.
  • Scalable oversight: No.
  • Adversarial training: No.
  • Unlearning: No.
  • RLHF and fine-tuning: Yes.
  • Commitments: No.
38%

DeepMind Training

38% Score 14% Weight
  • Filtering training data: No.
  • Scalable oversight: Yes. DeepMind has worked on this.
  • Adversarial training: 50%. For Gemini, they did red-teaming but don't discuss adversarial training.
  • Unlearning: No.
  • RLHF and fine-tuning: Yes.
  • Commitments: No.
16%

Meta Training

16% Score 14% Weight
  • Filtering training data: No.
  • Scalable oversight: No.
  • Adversarial training: 50%. They use red-teaming, and they used adversarial training for Llama 2 but not Llama 3.
  • Unlearning: No.
  • RLHF and fine-tuning: Yes.
  • Commitments: No.
55%

OpenAI Training

55% Score 14% Weight
  • Filtering training data: No.
  • Scalable oversight: Yes. OpenAI has worked on this.
  • Adversarial training: Yes.
  • Unlearning: No.
  • RLHF and fine-tuning: Yes.
  • Commitments: 50%. This is part of OpenAI's Prepardness Framework, which has not yet been implemented. The PF says OpenAI would pause as it reaches "critical" risk level if it has not achieved a (underspecified) safety desideratum. However, the "critical" risk threshold is very high; a model below the threshold could cause a catastrophe. Additionally, OpenAI may be required to share the dangerous model with Microsoft, which has no analogous commitments.
58%

Anthropic Training

58% Score 14% Weight
  • Filtering training data: No.
  • Scalable oversight: Yes. Anthropic has worked on this.
  • Adversarial training: 50%. Anthropic does red-teaming and has done research on adversarial training but it's not clear whether they use adversarial training.
  • Unlearning: No.
  • RLHF and fine-tuning: Yes.
  • Commitments: Yes. Anthropic commits to do risk assessment at least every 4x increase in effective training compute and to implement safety and security practices before crossing risk thresholds, pausing if necessary.
14%
Scalable alignment

Understand and control systems the lab creates

0%

Microsoft Scalable alignment

0% Score 10% Weight
  • Solve misaligned powerseeking: No.
  • Solve interpretability: No.
  • Solve deceptive alignment: No.
0%

DeepMind Scalable alignment

0% Score 10% Weight
  • Solve misaligned powerseeking: No.
  • Solve interpretability: No.
  • Solve deceptive alignment: No.
0%

Meta Scalable alignment

0% Score 10% Weight
  • Solve misaligned powerseeking: No.
  • Solve interpretability: No.
  • Solve deceptive alignment: No.
0%

OpenAI Scalable alignment

0% Score 10% Weight
  • Solve misaligned powerseeking: No.
  • Solve interpretability: No.
  • Solve deceptive alignment: No.
0%

Anthropic Scalable alignment

0% Score 10% Weight
  • Solve misaligned powerseeking: No.
  • Solve interpretability: No.
  • Solve deceptive alignment: No.
10%
Security*

Prevent model weights and research from being stolen

6%

Microsoft Security*

6% Score 9% Weight

Certifications, audits, and pentests

  • Certification: 75%. Microsoft has security certifications but the reports are private.
  • Release pentests: No.

Best practices

  • Source code in cloud: No.
  • Multiparty access controls: No.
  • Limit uploads from clusters with model weights: No.

Track record

  • Breach disclosure and track record: 0%, no breach disclosure policy.

Commitments

  • Commitments about future security: No.
6%

DeepMind Security*

6% Score 9% Weight

Certifications, audits, and pentests

  • Certification: 75%. DeepMind has security certifications but the reports are private.
  • Release pentests: No.

Best practices

  • Source code in cloud: No.
  • Multiparty access controls: No.
  • Limit uploads from clusters with model weights: No.

Track record

  • Breach disclosure and track record: 0%, no breach disclosure policy.

Commitments

  • Commitments about future security: No.
0%

Meta Security*

0% Score 9% Weight

Certifications, audits, and pentests

  • Certification: No.
  • Release pentests: No.

Best practices

  • Source code in cloud: No.
  • Multiparty access controls: No.
  • Limit uploads from clusters with model weights: No.

Track record

  • Breach disclosure and track record: 0%, no breach disclosure policy.

Commitments

  • Commitments about future security: No.
10%

OpenAI Security*

10% Score 9% Weight

Certifications, audits, and pentests

  • Certification: 75%. OpenAI has security certifications but the reports are private.
  • Release pentests: No.

Best practices

  • Source code in cloud: No.
  • Multiparty access controls: No.
  • Limit uploads from clusters with model weights: No.

Track record

  • Breach disclosure and track record: 0%, no breach disclosure policy.

Commitments

  • Commitments about future security: 25%. OpenAI's beta Preparedness Framework mentions improving security before reaching "high"-level risk, but it is not specific.
21%

Anthropic Security*

21% Score 9% Weight

Certifications, audits, and pentests

  • Certification: 75%. Anthropic has security certifications but the reports are private.
  • Release pentests: No.

Best practices

  • Source code in cloud: No.
  • Multiparty access controls: 50%. Anthropic says it is implementing this.
  • Limit uploads from clusters with model weights: No.

Track record

  • Breach disclosure and track record: 0%, no breach disclosure policy.

Commitments

  • Commitments about future security: 75%. Anthropic's Responsible Scaling Policy has security commitments, but they are insufficient.
9%
Internal governance

Have internal structure and processes to promote safety and help make important decisions well

0%

Microsoft Internal governance

0% Score 8% Weight

Organizational structure

  • Primary mandate is safety and benefit-sharing: No.
  • Good board: No.
  • Investors/shareholders have no formal power: No, they do have formal power.

Planning for pause

  • Detailed plan for pausing if necessary: No.

Leadership incentives

  • Executives have no financial interest: No, executives have equity.

Ombuds/whistleblowing/trustworthiness

  • Process for staff to escalate safety concerns: No.
  • Commit to not use non-disparagement agreements: No.
0%

DeepMind Internal governance

0% Score 8% Weight

Organizational structure

  • Primary mandate is safety and benefit-sharing: No.
  • Good board: No.
  • Investors/shareholders have no formal power: No, they do have formal power.

Planning for pause

  • Detailed plan for pausing if necessary: No.

Leadership incentives

  • Executives have no financial interest: No, executives have equity.

Ombuds/whistleblowing/trustworthiness

  • Process for staff to escalate safety concerns: No.
  • Commit to not use non-disparagement agreements: No.
0%

Meta Internal governance

0% Score 8% Weight

Organizational structure

  • Primary mandate is safety and benefit-sharing: No.
  • Good board: No.
  • Investors/shareholders have no formal power: No, they do have formal power.

Planning for pause

  • Detailed plan for pausing if necessary: No.

Leadership incentives

  • Executives have no financial interest: No, executives have equity.

Ombuds/whistleblowing/trustworthiness

  • Process for staff to escalate safety concerns: No.
  • Commit to not use non-disparagement agreements: No.
58%

OpenAI Internal governance

58% Score 8% Weight
DESCRIPTION OUT OF DATE

Organizational structure

  • Primary mandate is safety and benefit-sharing: Yes. OpenAI is controlled by a nonprofit. OpenAI's "mission is to ensure that artificial general intelligence benefits all of humanity," and its Charter mentions "Broadly distributed benefits" and "Long-term safety."
  • Good board: 68%. OpenAI is controlled by a nonprofit board.
  • Investors/shareholders have no formal power: Yes, OpenAI is controlled by a nonprofit.

Planning for pause

  • Detailed plan for pausing if necessary: No. OpenAI's Preparedness Framework says "we recognize that pausing deployment or development would be the last resort (but potentially necessary) option." OpenAI has not elaborated on what a pause would entail.

Leadership incentives

  • Executives have no financial interest: 25%. Executives have equity, except CEO Sam Altman, who has a small amount of equity indirectly.

Ombuds/whistleblowing/trustworthiness

  • Process for staff to escalate safety concerns: 13%. Not yet, but OpenAI says it is "Creating a whistleblower hotline to serve as an anonymous reporting resource for all OpenAI employees and contractors."
  • Commit to not use non-disparagement agreements: No.
71%

Anthropic Internal governance

71% Score 8% Weight
DESCRIPTION OUT OF DATE

Organizational structure

  • Primary mandate is safety and benefit-sharing: Yes. Anthropic is a public-benefit corporation, and its mission is to "responsibly develop and maintain advanced AI for the long-term benefit of humanity."
  • Good board: 35%. The board has a safety mandate and real power, but it is not independent.
  • Investors/shareholders have no formal power: No, they are represented by two board seats and can abrogate the Long-Term Benefit Trust.

Planning for pause

  • Detailed plan for pausing if necessary: 50%. Anthropic commits to be financially prepared for a pause for safety, but it has not said what its capabilities researchers would work on during a pause.

Leadership incentives

  • Executives have no financial interest: No, executives have equity.

Ombuds/whistleblowing/trustworthiness

  • Process for staff to escalate safety concerns: 25%. Anthropic has implemented a "non-compliance reporting policy" but has not yet shared details.
  • Commit to not use non-disparagement agreements: No.
8%
Alignment program

Perform and share alignment research at all; have an alignment research team

0%

Microsoft Alignment program

0% Score 6% Weight
Have an alignment research team and publish some alignment research: No.
100%

DeepMind Alignment program

100% Score 6% Weight
Have an alignment research team and publish some alignment research: Yes.
0%

Meta Alignment program

0% Score 6% Weight
Have an alignment research team and publish some alignment research: No.
100%

OpenAI Alignment program

100% Score 6% Weight
Have an alignment research team and publish some alignment research: Yes.
100%

Anthropic Alignment program

100% Score 6% Weight
Have an alignment research team and publish some alignment research: Yes.
6%
Alignment plan

Make a plan for aligning powerful systems the lab creates

0%

Microsoft Alignment plan

0% Score 6% Weight
Have a plan to deal with misalignment: No.
38%

DeepMind Alignment plan

38% Score 6% Weight
Have a plan to deal with misalignment: 38%. The DeepMind alignment team basically has a plan, but DeepMind itself does not.
0%

Meta Alignment plan

0% Score 6% Weight
Have a plan to deal with misalignment: No.
0%

OpenAI Alignment plan

0% Score 6% Weight
No. OpenAI used to plan to use scalable oversight to automate alignment research, but since the Superalignment team was dissolved it does not seem to have a plan.
75%

Anthropic Alignment plan

75% Score 6% Weight
Have a plan to deal with misalignment: 75%. Anthropic has shared how it thinks about alignment and its portfolio approach.
6%
Public statements

Be aware of AI risk, that AI safety might be really hard, and that risks might be hard to notice

0%

Microsoft Public statements

0% Score 3% Weight
  • Talk about and understand extreme risks from AI: No. Microsoft and its CEO seem to never talk about extreme risks or the alignment problem.
  • Clearly describe a worst-case plausible outcome from AI and state the lab's credence in such an outcome: No.
20%

DeepMind Public statements

20% Score 3% Weight
  • Talk about and understand extreme risks from AI: 25%. DeepMind leadership sometimes talk about extreme risks and the alignment problem, but DeepMind as an organization rarely does, and Google and its leadership don't.
  • Clearly describe a worst-case plausible outcome from AI and state the lab's credence in such an outcome: No.
0%

Meta Public statements

0% Score 3% Weight
  • Talk about and understand extreme risks from AI: No. Meta AI and its leadership seem to disbelieve in extreme risks and the alignment problem.
  • Clearly describe a worst-case plausible outcome from AI and state the lab's credence in such an outcome: No.
40%

OpenAI Public statements

40% Score 3% Weight
  • Talk about and understand extreme risks from AI: 50%. OpenAI and its leadership sometimes talk about extreme risks and the alignment problem.
  • Clearly describe a worst-case plausible outcome from AI and state the lab's credence in such an outcome: No.
60%

Anthropic Public statements

60% Score 3% Weight
  • Talk about and understand extreme risks from AI: 75%. Anthropic and its leadership often talk about extreme risks and the alignment problem.
  • Clearly describe a worst-case plausible outcome from AI and state the lab's credence in such an outcome: No.
3%
5%
36%
52%
Deployment

Avoid dangers in deployment, and boost safety but avoid boosting dangerous research

4%
26%
61%
Risk assessment

Predict models' dangerous capabilities and check for warning signs for those capabilities

8%
45%
47%
Training

Make new systems safer, create them cautiously, and advance safer kinds of systems

16%
55%
58%
Scalable alignment

Understand and control systems the lab creates

0%
0%
Security*

Prevent model weights and research from being stolen

0%
10%
21%
Internal governance

Have internal structure and processes to promote safety and help make important decisions well

0%
58%
71%
Alignment program

Perform and share alignment research at all; have an alignment research team

0%
100%
100%
Alignment plan

Make a plan for aligning powerful systems the lab creates

0%
0%
75%
Public statements

Be aware of AI risk, that AI safety might be really hard, and that risks might be hard to notice

0%
40%
60%

This scorecard is mostly up to date as of 1 November 2024. While the scores are mostly up to date, the site contains many references to old models, policies, and facts.

*We believe that the Security scores are very poor indicators about labs’ security. Security is hard to evaluate from outside the organization; few organizations say much about their security. But we believe that each security criterion corresponds to a good ask for labs. If you have suggestions for better criteria—or can point us to sources showing that our scoring is wrong, or want to convince us that some of our criteria should be removed—please let us know.