AI Lab Watch

We collect actions for frontier Al labs to avert extreme risks from AI, then evaluate particular labs accordingly.

These criteria are not exhaustive or sufficient: they’re the best we’ve got given what we currently know, given that they have to be concrete, and given that they have to be evaluated based on public information. Some important variables are hard to measure, especially from outside the labs.

This website is a low-key research preview. We’re sharing it to elicit feedback and help determine which direction to take it. Although it’s in beta, we endorse the content (modulo a few criteria where it’s hard to find sources) and you are welcome to share and discuss it publicly. Leave feedback here.

Microsoft

Google DeepMind

Meta AI

OpenAI

Anthropic

Weight

Overall score

25%

34%

48%

100%

Deployment

Avoid dangers in deployment, and boost safety but avoid boosting dangerous research

Microsoft Deployment

3% Score 24% Weight

Releasing models well

Deploy based on risk assessment results: No.
Structured access: No, Microsoft releases its model weights, and it has not published a capability/risk threshold past which it would stop nor any other plan to notice if releasing was dangerous.
Staged release: 25%.

Keeping capabilities research private

Policy against publishing capabilities research: No.
Keep research behind the lab's language models private: 8%.
Keep other capabilities research private: No.

Deployment protocol

Safety scaffolding: 1%. Microsoft uses content filtering but not to avert misuse, and it does not use other safety scaffolding techniques, and it releases its model weights anyway.
Commit to respond to AI scheming: No.

Abridged; see "Deployment" for details.

18%

DeepMind Deployment

18% Score 24% Weight

Releasing models well

Deploy based on risk assessment results: No.
Structured access: 54%. DeepMind deploys its most powerful models via API.
Staged release: Yes.

Keeping capabilities research private

Policy against publishing capabilities research: No.
Keep research behind the lab's language models private: 25%, DeepMind publishes most of this research, including on its Gemini models.
Keep other capabilities research private: No.

Deployment protocol

Safety scaffolding: 4%. DeepMind uses filters but not to avert misuse, and it does not use other safety scaffolding techniques.
Commit to respond to AI scheming: No.

Abridged; see "Deployment" for details.

Meta Deployment

4% Score 24% Weight

Releasing models well

Deploy based on risk assessment results: No.
Structured access: No, Meta AI releases its model weights, and it has not published a capability/risk threshold past which it would stop nor any other plan to notice if releasing was dangerous.
Staged release: No.

Keeping capabilities research private

Policy against publishing capabilities research: No.
Keep research behind the lab's language models private: 25%. Meta AI publishes most of this research, but omits details about training data.
Keep other capabilities research private: No.

Deployment protocol

Safety scaffolding: 1%. It uses filtering in its chatbot, Meta AI, but not to avert misuse, and it does not use other safety scaffolding techniques, and it releases its model weights anyway.
Commit to respond to AI scheming: No.

Abridged; see "Deployment" for details.

26%

OpenAI Deployment

26% Score 24% Weight

Releasing models well

Deploy based on risk assessment results: No, they share their models with Microsoft, which has made no such commitments. Otherwise 33%: their Preparedness Framework says they won't externally deploy a model with a post-mitigation risk score of "High." But internal deployment is treated the same as development, with a very high capability threshold. And it makes no commitments about risk assessment during deployment or what would make them modify or undeploy a deployed model. And they don't commit to particular safety practices (or control evaluations).
Structured access: 54%. OpenAI deploys its models via API.
Staged release: Yes.

Keeping capabilities research private

Policy against publishing capabilities research: No.
Keep research behind the lab's language models private: Yes.
Keep other capabilities research private: Yes.

Deployment protocol

Safety scaffolding: 0%.
Commit to respond to AI scheming: No.

Abridged; see "Deployment" for details.

61%

Anthropic Deployment

61% Score 24% Weight

Releasing models well

Deploy based on risk assessment results: 92%. Risk assessment results affect deployment decisions per Anthropic's Responsible Scaling Policy. The RSP also mentions "de-deployment" if Anthropic realizes that a model is much more capable than it believed, but it does not have great details or commitments. And the RSP currently contains a "drafting error" suggesting that risk assessment does not occur during deployment.
Structured access: 54%. Anthropic deploys its models via API.
Staged release: Yes.

Keeping capabilities research private

Policy against publishing capabilities research: 67%. They have a policy but haven't shared the details.
Keep research behind the lab's language models private: Yes.
Keep other capabilities research private: Yes.

Deployment protocol

Safety scaffolding: 28%. Anthropic uses filtering and has made commitments to use safety scaffolding techniques in the future.
Commit to respond to AI scheming: No.

Abridged; see "Deployment" for details.

24%

Risk assessment

Predict models' dangerous capabilities and check for warning signs for those capabilities

Microsoft Risk assessment

1% Score 20% Weight

Measuring threats

Use model evals for dangerous capabilities before deployment: 3%. Microsoft says "red team testing will include testing of dangerous capabilities, including related to biosecurity and cybersecurity," but it doesn't share details, have model evals, or test for autonomous replication or situational awareness capabilities.
Share how the lab does model evals and red-teaming: No.
Prepare to make "control arguments" for powerful models: No.
Give third parties access to do model evals: No.

Commitments

Commit to use risk assessment frequently enough: No.

Accountability

Publish regular updates on risk assessments: No.
Have good processes for revising policies: No.
Elicit external review of risk assessment practices: No.

29%

DeepMind Risk assessment

29% Score 20% Weight

Measuring threats

Use model evals for dangerous capabilities before deployment: 50%. DeepMind has good model evals and shares relevant details. However, it doesn't necessarily use these evals before deployment.
Share how the lab does model evals and red-teaming: Yes.
Prepare to make "control arguments" for powerful models: No.
Give third parties access to do model evals: 25%. DeepMind shared Gemini Ultra with external groups before deployment, but it didn't give them deep access. DeepMind has not shared results of this testing.

Commitments

Commit to use risk assessment frequently enough: No.

Accountability

Publish regular updates on risk assessments: No.
Have good processes for revising policies: No.
Elicit external review of risk assessment practices: No.

12%

Meta Risk assessment

12% Score 20% Weight

Measuring threats

Use model evals for dangerous capabilities before deployment: 53%. Meta AI tested Llama 3 for some hacking capabilities and open-sourced this evaluation.
Share how the lab does model evals and red-teaming: No.
Prepare to make "control arguments" for powerful models: No.
Give third parties access to do model evals: No.

Commitments

Commit to use risk assessment frequently enough: No.

Accountability

Publish regular updates on risk assessments: No.
Have good processes for revising policies: No.
Elicit external review of risk assessment practices: No.

25%

OpenAI Risk assessment

25% Score 20% Weight

Measuring threats

Use model evals for dangerous capabilities before deployment: 32%. OpenAI's not-yet-implemented Preparedness Framework will involve model evals for model autonomy, cybersecurity, CBRN, and persuasion.
Share how the lab does model evals and red-teaming: No.
Prepare to make "control arguments" for powerful models: No.
Give third parties access to do model evals: 25%. The Preparedness Framework calls for some external review, but not necessarily independent model evals.

Commitments

Commit to use risk assessment frequently enough: 50%. The Preparedness Framework says "We will be running these evaluations continually, i.e., as often as needed to catch any non-trivial capability change, including before, during, and after training. This would include whenever there is a >2x effective compute increase or major algorithmic breakthrough." But we have no idea how they can run evaluations before training. More importantly, it's not clear whether "after training" includes during deployment. Additionally, the Preparedness Framework has not yet been implemented.

Accountability

Publish regular updates on risk assessments: 17%. Once the Preparedness Framework is implemented, there will be a monthly internal report and a public scorecard with very high-level results.
Have good processes for revising policies: 43%. The OpenAI board reviews and can veto changes.
Elicit external review of risk assessment practices: No.

41%

Anthropic Risk assessment

41% Score 20% Weight

Measuring threats

Use model evals for dangerous capabilities before deployment: 65%. Anthropic does model evals for autonomous replication and adaption, cyber capabilities, and biology capabilities.
Share how the lab does model evals and red-teaming: 25%.
Prepare to make "control arguments" for powerful models: No.
Give third parties access to do model evals: No.

Commitments

Commit to use risk assessment frequently enough: Yes. Anthropic commits to do risk assessment every 4x increase in effective training compute and every 3 months.

Accountability

Publish regular updates on risk assessments: 33%, just internally.
Have good processes for revising policies: 39%. Changes to the RSP must be approved by the board and published before they are implemented. And Anthropic has implemented a "non-compliance reporting policy" and will share details soon.
Elicit external review of risk assessment practices: No.

20%

Training

Make new systems safer, create them cautiously, and advance safer kinds of systems

Microsoft Training

9% Score 14% Weight

Filtering training data: No.
Scalable oversight: No.
Adversarial training: No.
Unlearning: No.
RLHF and fine-tuning: Yes.
Commitments: No.

38%

DeepMind Training

38% Score 14% Weight

Filtering training data: No.
Scalable oversight: Yes. DeepMind has worked on this.
Adversarial training: 50%. For Gemini, they did red-teaming but don't discuss adversarial training.
Unlearning: No.
RLHF and fine-tuning: Yes.
Commitments: No.

16%

Meta Training

16% Score 14% Weight

Filtering training data: No.
Scalable oversight: No.
Adversarial training: 50%. They use red-teaming, and they used adversarial training for Llama 2 but not Llama 3.
Unlearning: No.
RLHF and fine-tuning: Yes.
Commitments: No.

55%

OpenAI Training

55% Score 14% Weight

Filtering training data: No.
Scalable oversight: Yes. OpenAI has worked on this.
Adversarial training: Yes.
Unlearning: No.
RLHF and fine-tuning: Yes.
Commitments: 50%. This is part of OpenAI's Prepardness Framework, which has not yet been implemented. The PF says OpenAI would pause as it reaches "critical" risk level if it has not achieved a (underspecified) safety desideratum. However, the "critical" risk threshold is very high; a model below the threshold could cause a catastrophe. Additionally, OpenAI may be required to share the dangerous model with Microsoft, which has no analogous commitments.

58%

Anthropic Training

58% Score 14% Weight

Filtering training data: No.
Scalable oversight: Yes. Anthropic has worked on this.
Adversarial training: 50%. Anthropic does red-teaming and has done research on adversarial training but it's not clear whether they use adversarial training.
Unlearning: No.
RLHF and fine-tuning: Yes.
Commitments: Yes. Anthropic commits to do risk assessment at least every 4x increase in effective training compute and to implement safety and security practices before crossing risk thresholds, pausing if necessary.

14%

Scalable alignment

Understand and control systems the lab creates

Microsoft Scalable alignment

0% Score 10% Weight

Solve misaligned powerseeking: No.
Solve interpretability: No.
Solve deceptive alignment: No.

DeepMind Scalable alignment

0% Score 10% Weight

Solve misaligned powerseeking: No.
Solve interpretability: No.
Solve deceptive alignment: No.

Meta Scalable alignment

0% Score 10% Weight

Solve misaligned powerseeking: No.
Solve interpretability: No.
Solve deceptive alignment: No.

OpenAI Scalable alignment

0% Score 10% Weight

Solve misaligned powerseeking: No.
Solve interpretability: No.
Solve deceptive alignment: No.

Anthropic Scalable alignment

0% Score 10% Weight

Solve misaligned powerseeking: No.
Solve interpretability: No.
Solve deceptive alignment: No.

10%

Security*

Prevent model weights and research from being stolen

Microsoft Security*

6% Score 9% Weight

Certifications, audits, and pentests

Certification: 75%. Microsoft has security certifications but the reports are private.
Release pentests: No.

Best practices

Source code in cloud: No.
Multiparty access controls: No.
Limit uploads from clusters with model weights: No.

Track record

Breach disclosure and track record: 0%, no breach disclosure policy.

Commitments

Commitments about future security: No.

DeepMind Security*

6% Score 9% Weight

Certifications, audits, and pentests

Certification: 75%. DeepMind has security certifications but the reports are private.
Release pentests: No.

Best practices

Source code in cloud: No.
Multiparty access controls: No.
Limit uploads from clusters with model weights: No.

Track record

Breach disclosure and track record: 0%, no breach disclosure policy.

Commitments

Commitments about future security: No.

Meta Security*

0% Score 9% Weight

Certifications, audits, and pentests

Certification: No.
Release pentests: No.

Best practices

Source code in cloud: No.
Multiparty access controls: No.
Limit uploads from clusters with model weights: No.

Track record

Breach disclosure and track record: 0%, no breach disclosure policy.

Commitments

Commitments about future security: No.

10%

OpenAI Security*

10% Score 9% Weight

Certifications, audits, and pentests

Certification: 75%. OpenAI has security certifications but the reports are private.
Release pentests: No.

Best practices

Source code in cloud: No.
Multiparty access controls: No.
Limit uploads from clusters with model weights: No.

Track record

Breach disclosure and track record: 0%, no breach disclosure policy.

Commitments

Commitments about future security: 25%. OpenAI's beta Preparedness Framework mentions improving security before reaching "high"-level risk, but it is not specific.

21%

Anthropic Security*

21% Score 9% Weight

Certifications, audits, and pentests

Certification: 75%. Anthropic has security certifications but the reports are private.
Release pentests: No.

Best practices

Source code in cloud: No.
Multiparty access controls: 50%. Anthropic says it is implementing this.
Limit uploads from clusters with model weights: No.

Track record

Breach disclosure and track record: 0%, no breach disclosure policy.

Commitments

Commitments about future security: 75%. Anthropic's Responsible Scaling Policy has security commitments, but they are insufficient.

Internal governance

Have internal structure and processes to promote safety and help make important decisions well

Microsoft Internal governance

0% Score 8% Weight

Organizational structure

Primary mandate is safety and benefit-sharing: No.
Good board: No.
Investors/shareholders have no formal power: No, they do have formal power.

Planning for pause

Detailed plan for pausing if necessary: No.

Leadership incentives

Executives have no financial interest: No, executives have equity.

Ombuds/whistleblowing/trustworthiness

Process for staff to escalate safety concerns: No.
Commit to not use non-disparagement agreements: No.

DeepMind Internal governance

0% Score 8% Weight

Organizational structure

Primary mandate is safety and benefit-sharing: No.
Good board: No.
Investors/shareholders have no formal power: No, they do have formal power.

Planning for pause

Detailed plan for pausing if necessary: No.

Leadership incentives

Executives have no financial interest: No, executives have equity.

Ombuds/whistleblowing/trustworthiness

Process for staff to escalate safety concerns: No.
Commit to not use non-disparagement agreements: No.

Meta Internal governance

0% Score 8% Weight

Organizational structure

Primary mandate is safety and benefit-sharing: No.
Good board: No.
Investors/shareholders have no formal power: No, they do have formal power.

Planning for pause

Detailed plan for pausing if necessary: No.

Leadership incentives

Executives have no financial interest: No, executives have equity.

Ombuds/whistleblowing/trustworthiness

Process for staff to escalate safety concerns: No.
Commit to not use non-disparagement agreements: No.

33%

OpenAI Internal governance

33% Score 8% Weight

Organizational structure

Primary mandate is safety and benefit-sharing: Yes. OpenAI is controlled by a nonprofit. OpenAI's "mission is to ensure that artificial general intelligence benefits all of humanity," and its Charter mentions "Broadly distributed benefits" and "Long-term safety."
Good board: 68%. OpenAI is controlled by a nonprofit board.
Investors/shareholders have no formal power: Yes, OpenAI is controlled by a nonprofit.

Planning for pause

Detailed plan for pausing if necessary: No. OpenAI's Preparedness Framework says "we recognize that pausing deployment or development would be the last resort (but potentially necessary) option." OpenAI has not elaborated on what a pause would entail.

Leadership incentives

Executives have no financial interest: 25%. Executives have equity, except CEO Sam Altman, who has a small amount of equity indirectly.

Ombuds/whistleblowing/trustworthiness

Process for staff to escalate safety concerns: 13%. Not yet, but OpenAI says it is "Creating a whistleblower hotline to serve as an anonymous reporting resource for all OpenAI employees and contractors."
Commit to not use non-disparagement agreements: No.

33%

Anthropic Internal governance

33% Score 8% Weight

Organizational structure

Primary mandate is safety and benefit-sharing: Yes. Anthropic is a public-benefit corporation, and its mission is to "responsibly develop and maintain advanced AI for the long-term benefit of humanity."
Good board: 35%. The board has a safety mandate and real power, but it is not independent.
Investors/shareholders have no formal power: No, they are represented by two board seats and can abrogate the Long-Term Benefit Trust.

Planning for pause

Detailed plan for pausing if necessary: 50%. Anthropic commits to be financially prepared for a pause for safety, but it has not said what its capabilities researchers would work on during a pause.

Leadership incentives

Executives have no financial interest: No, executives have equity.

Ombuds/whistleblowing/trustworthiness

Process for staff to escalate safety concerns: 25%. Anthropic has implemented a "non-compliance reporting policy" but has not yet shared details.
Commit to not use non-disparagement agreements: No.

Alignment program

Perform and share alignment research at all; have an alignment research team

Microsoft Alignment program

0% Score 6% Weight

Have an alignment research team and publish some alignment research: No.

100%

DeepMind Alignment program

100% Score 6% Weight

Have an alignment research team and publish some alignment research: Yes.

Meta Alignment program

0% Score 6% Weight

Have an alignment research team and publish some alignment research: No.

100%

OpenAI Alignment program

100% Score 6% Weight

Have an alignment research team and publish some alignment research: Yes.

100%

Anthropic Alignment program

100% Score 6% Weight

Have an alignment research team and publish some alignment research: Yes.

Alignment plan

Make a plan for aligning powerful systems the lab creates

Microsoft Alignment plan

0% Score 6% Weight

Have a plan to deal with misalignment: No.

38%

DeepMind Alignment plan

38% Score 6% Weight

Have a plan to deal with misalignment: 38%. The DeepMind alignment team basically has a plan, but DeepMind itself does not.

Meta Alignment plan

0% Score 6% Weight

Have a plan to deal with misalignment: No.

75%

OpenAI Alignment plan

75% Score 6% Weight

Have a plan to deal with misalignment: 75%. OpenAI plans to use scalable oversight to automate alignment research.

75%

Anthropic Alignment plan

75% Score 6% Weight

Have a plan to deal with misalignment: 75%. Anthropic has shared how it thinks about alignment and its portfolio approach.

Public statements

Be aware of AI risk, that AI safety might be really hard, and that risks might be hard to notice

Microsoft Public statements

0% Score 3% Weight

Talk about and understand extreme risks from AI: No. Microsoft and its CEO seem to never talk about extreme risks or the alignment problem.
Clearly describe a worst-case plausible outcome from AI and state how likely the lab considers it: No.

20%

DeepMind Public statements

20% Score 3% Weight

Talk about and understand extreme risks from AI: 25%. DeepMind leadership sometimes talk about extreme risks and the alignment problem, but DeepMind as an organization rarely does, and Google and its leadership don't.
Clearly describe a worst-case plausible outcome from AI and state how likely the lab considers it: No.

Meta Public statements

0% Score 3% Weight

Talk about and understand extreme risks from AI: No. Meta AI and its leadership seem to disbelieve in extreme risks and the alignment problem.
Clearly describe a worst-case plausible outcome from AI and state how likely the lab considers it: No.

40%

OpenAI Public statements

40% Score 3% Weight

Talk about and understand extreme risks from AI: 50%. OpenAI and its leadership sometimes talk about extreme risks and the alignment problem.
Clearly describe a worst-case plausible outcome from AI and state how likely the lab considers it: No.

60%

Anthropic Public statements

60% Score 3% Weight

Talk about and understand extreme risks from AI: 75%. Anthropic and its leadership often talk about extreme risks and the alignment problem.
Clearly describe a worst-case plausible outcome from AI and state how likely the lab considers it: No.

Overall score

Microsoft

Google DeepMind

25%

Meta AI

OpenAI

34%

Anthropic

48%

Deployment

Avoid dangers in deployment, and boost safety but avoid boosting dangerous research

Microsoft

Google DeepMind

18%

Meta AI

OpenAI

26%

Anthropic

61%

Risk assessment

Predict models' dangerous capabilities and check for warning signs for those capabilities

Microsoft

Google DeepMind

29%

Meta AI

12%

OpenAI

25%

Anthropic

41%

Training

Make new systems safer, create them cautiously, and advance safer kinds of systems

Microsoft

Google DeepMind

38%

Meta AI

16%

OpenAI

55%

Anthropic

58%

Scalable alignment

Understand and control systems the lab creates

Microsoft

Google DeepMind

Meta AI

OpenAI

Anthropic

Security*

Prevent model weights and research from being stolen

Microsoft

Google DeepMind

Meta AI

OpenAI

10%

Anthropic

21%

Internal governance

Have internal structure and processes to promote safety and help make important decisions well

Microsoft

Google DeepMind

Meta AI

OpenAI

33%

Anthropic

33%

Alignment program

Perform and share alignment research at all; have an alignment research team

Microsoft

Google DeepMind

100%

Meta AI

OpenAI

100%

Anthropic

100%

Alignment plan

Make a plan for aligning powerful systems the lab creates

Microsoft

Google DeepMind

38%

Meta AI

OpenAI

75%

Anthropic

75%

Public statements

Be aware of AI risk, that AI safety might be really hard, and that risks might be hard to notice

Microsoft

Google DeepMind

20%

Meta AI

OpenAI

40%

Anthropic

60%

This site is up to date as of 1 May 2024. It does not yet include recent events related to OpenAI (Leike departure, Superalignment dissolution) or DeepMind (Frontier Safety Framework).

*We believe that the Security scores are very poor indicators about labs’ security. Security is hard to evaluate from outside the organization; few organizations say much about their security. But we believe that each security criterion corresponds to a good ask for labs. If you have suggestions for better criteria—or can point us to sources showing that our scoring is wrong, or want to convince us that some of our criteria should be removed—please let us know.