Anthropic

Overall score
61%

Anthropic Deployment

61% Score 24% Weight

Releasing models well

  • Deploy based on risk assessment results: 92%. Risk assessment results affect deployment decisions per Anthropic's Responsible Scaling Policy. The RSP also mentions "de-deployment" if Anthropic realizes that a model is much more capable than it believed, but it does not have great details or commitments. And the RSP currently contains a "drafting error" suggesting that risk assessment does not occur during deployment.
  • Structured access: 54%. Anthropic deploys its models via API.
  • Staged release: Yes.

Keeping capabilities research private

  • Policy against publishing capabilities research: 67%. They have a policy but haven't shared the details.
  • Keep research behind the lab's language models private: Yes.
  • Keep other capabilities research private: Yes.

Deployment protocol

  • Safety scaffolding: 28%. Anthropic uses filtering and has made commitments to use safety scaffolding techniques in the future.
  • Commit to respond to AI scheming: No.

Abridged; see "Deployment" for details.
47%

Anthropic Risk assessment

47% Score 20% Weight

Measuring threats

  • Use model evals for dangerous capabilities before deployment: 65%. Anthropic does model evals for autonomous replication and adaption, cyber capabilities, and biology capabilities.
  • Share how the lab does model evals and red-teaming: 25%.
  • Prepare to make "control arguments" for powerful models: No.
  • Give third parties access to do model evals: 25%. Anthropic shared pre-deployment access for the new Claude 3.5 Sonnet with US AISI, UK AISI, and METR. The depth of access is unclear. It has made some commitment to US AISI but its future plans are unclear.

Commitments

  • Commit to use risk assessment frequently enough: Yes. Anthropic commits to do risk assessment every 4x increase in effective training compute and every 3 months.

Accountability

  • Publish regular updates on risk assessments: 33%, just internally.
  • Have good processes for revising policies: 61%. Changes to the RSP must be approved by the board and published before they are implemented. And Anthropic has a "non-compliance reporting policy."
  • Elicit external review of risk assessment practices: No.
58%

Anthropic Training

58% Score 14% Weight
  • Filtering training data: No.
  • Scalable oversight: Yes. Anthropic has worked on this.
  • Adversarial training: 50%. Anthropic does red-teaming and has done research on adversarial training but it's not clear whether they use adversarial training.
  • Unlearning: No.
  • RLHF and fine-tuning: Yes.
  • Commitments: Yes. Anthropic commits to do risk assessment at least every 4x increase in effective training compute and to implement safety and security practices before crossing risk thresholds, pausing if necessary.
0%

Anthropic Scalable alignment

0% Score 10% Weight
  • Solve misaligned powerseeking: No.
  • Solve interpretability: No.
  • Solve deceptive alignment: No.
21%

Anthropic Security*

21% Score 9% Weight

Certifications, audits, and pentests

  • Certification: 75%. Anthropic has security certifications but the reports are private.
  • Release pentests: No.

Best practices

  • Source code in cloud: No.
  • Multiparty access controls: 50%. Anthropic says it is implementing this.
  • Limit uploads from clusters with model weights: No.

Track record

  • Breach disclosure and track record: 0%, no breach disclosure policy.

Commitments

  • Commitments about future security: 75%. Anthropic's Responsible Scaling Policy has security commitments, but they are insufficient.
71%

Anthropic Internal governance

71% Score 8% Weight
DESCRIPTION OUT OF DATE

Organizational structure

  • Primary mandate is safety and benefit-sharing: Yes. Anthropic is a public-benefit corporation, and its mission is to "responsibly develop and maintain advanced AI for the long-term benefit of humanity."
  • Good board: 35%. The board has a safety mandate and real power, but it is not independent.
  • Investors/shareholders have no formal power: No, they are represented by two board seats and can abrogate the Long-Term Benefit Trust.

Planning for pause

  • Detailed plan for pausing if necessary: 50%. Anthropic commits to be financially prepared for a pause for safety, but it has not said what its capabilities researchers would work on during a pause.

Leadership incentives

  • Executives have no financial interest: No, executives have equity.

Ombuds/whistleblowing/trustworthiness

  • Process for staff to escalate safety concerns: 25%. Anthropic has implemented a "non-compliance reporting policy" but has not yet shared details.
  • Commit to not use non-disparagement agreements: No.
100%

Anthropic Alignment program

100% Score 6% Weight
Have an alignment research team and publish some alignment research: Yes.
75%

Anthropic Alignment plan

75% Score 6% Weight
Have a plan to deal with misalignment: 75%. Anthropic has shared how it thinks about alignment and its portfolio approach.
60%

Anthropic Public statements

60% Score 3% Weight
  • Talk about and understand extreme risks from AI: 75%. Anthropic and its leadership often talk about extreme risks and the alignment problem.
  • Clearly describe a worst-case plausible outcome from AI and state the lab's credence in such an outcome: No.

*We believe that the Security scores are very poor indicators about labs’ security. Security is hard to evaluate from outside the organization; few organizations say much about their security. But we believe that each security criterion corresponds to a good ask for labs. If you have suggestions for better criteria—or can point us to sources showing that our scoring is wrong, or want to convince us that some of our criteria should be removed—please let us know.

On scoring and weights, see here.

Summary

In 2021, some of the OpenAI safety team quit to found Anthropic to promote AI safety. Anthropic is led by Dario Amodei. It is a Delaware public-benefit corporation. Its mission is to “responsibly develop and maintain advanced AI for the long-term benefit of humanity.” Its flagship family of models is Claude 3. It deploys them via its chatbot Claude, its API, and the Amazon Bedrock API.

Anthropic has lots of safety researchers and does lots of good safety work. Its leadership and staff tend to say they are very concerned about extreme risks from AI.

Anthropic’s Responsible Scaling Policy describes its risk assessment practices and contains commitments about risk assessment and how safety and security practices and model development and deployment decisions depend on risk assessment results.

Anthropic plans for its Long-Term Benefit Trust to elect a majority of its board by 2027, but its shareholders can abrogate the Trust and the details are unclear.

Deployment

Labs should do risk assessment before deployment and avoid deploying dangerous models. They should release their systems narrowly, to maintain control over their systems, and with structured access. They should deploy their systems in scaffolding designed to improve safety by detecting and preventing misbehavior. They should deploy to boost safety research while avoiding boosting capabilities research on dangerous paths. More.

What Anthropic is doing

Anthropic deploys its Claude 3 models via API and its chatbot, Claude.

Evaluation

We give Anthropic a score of 61% on deployment.

For more, including weighting between different criteria, see the Deployment page.

Deployment decision.

  • Commit to do pre-deployment risk assessment and not deploy models with particular dangerous capabilities (including internal deployment), at least until implementing particular safety practices or passing control evaluations. Score: Yes. See Anthropic’s Responsible Scaling Policy.
    • … and commit to do risk assessment during deployment, before pushing major changes and otherwise at least every 3 months (to account for improvements in fine-tuning, scaffolding, plugins, prompting, etc.), and commit to implement particular safety practices or partially undeploy dangerous models if risks appear. Score: 75%. Anthropic’s Responsible Scaling Policy only mentions evaluation “During model training and fine-tuning.” But senior staff member Zac Hatfield-Dodds told us “this was a simple drafting error - our every-three months evaluation commitment is intended to continue during deployment. This has been clarified for the next version, and we’ve been acting accordingly all along.” The RSP also mentions “de-deployment” if Anthropic realizes that a model is much more capable than it believed, but it does not have great details or commitments.

Release method.

Structured access:

  • not releasing dangerous model weights (or code): the lab should deploy its most powerful models privately or release via API or similar, or at least have some specific risk-assessment-result that would make it stop releasing model weights. Score: Yes. Anthropic doesn’t release its model weights.
    • … and effectively avoid helping others create powerful models (via model inversion or imitation learning). It’s unclear what practices labs should implement, so for now we use the low bar of whether they say they do anything to prevent users from (1) determining model weights, (2) using model outputs to train other models, and (3) determining training data. Score: No. Anthropic doesn’t say it does this.
    • … and limit deep access to powerful models. It’s unclear what labs should do, so for now we check whether the lab disables or has any limitations on access to each of logprobs, embeddings at arbitrary layers, activations, and fine-tuning. Score: Yes. Anthropic’s API doesn’t give access to logprobs, embeddings, or activations or allow fine-tuning. They allow fine-tuning privately.
      • … and differential access: systematically give more access to safety researchers and auditors. Score: No. But the RSP mentions the possibility of “Tiered access” at ASL-3.
    • … and terms of service: have some rules about model use, including rules aimed to prevent catastrophic misuse, model duplication, or otherwise using the model to train other models. Score: Yes. See Preventing and monitoring model misuse, Consumer Terms of Service, and Acceptable Use Policy.

Staged release:

  • Deploy narrowly at first; use narrow deployment to identify and fix issues. Score: Yes, mostly. When deploying Claude 3, Anthropic did not enable public fine-tuning. But it’s not clear how it restricted or monitored the fine-tuning that it did allow.

Keeping capabilities research private.

We evaluate how well labs avoid diffusing dangerous capabilities research. We exclude model weights (and related artifacts like code); those are considered in “Releasing models.”

Publication policy/process (for LLM research)

  • Policy: the lab should say that it doesn’t publish dangerous or acceleratory research/artifacts and say it has a policy/process to ensure this. Score: Yes. Anthropic cofounder Chris Olah says “we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk. In the case of difficult scenarios, we have a formal infohazard review procedure.”
    • … and share policy details. Score: No.

Track record for recent major LM projects: not sharing in practice (via publication or leaking—we can at least notice public leaks):

  • Architecture (unless it’s an unusually safe architecture). Score: Yes.
  • Dataset (except particular corpuses to include or exclude for safety). Score: Yes.
  • Lessons learned. Score: Yes.

Track record for other LM capabilities research or other dangerous or acceleratory work: not sharing in practice. Evaluated holistically. Score: Yes. Anthropic successfully avoids publishing capabilities research.

Safety scaffolding.

Filter out model inputs or outputs that enable misuse, in particular via cyberoffense and bioengineering. Ideally demonstrate that the protocol is very effective for averting misuse-enabling model outputs. Evaluated holistically. Score: 75%. Anthropic says “We use . . . automated detection of CBRN and cyber risk-related prompts on all our deployed Claude 3 models.” And: “We use automated systems to detect violations of our AUP [acceptable use policy] as they occur in real time. User prompts that are flagged as violating the AUP trigger an instruction to our models to respond even more cautiously.” And: “[we use] continuous classifiers to monitor prompts and outputs for harmful, malicious use cases that violate our AUP.” But they do not share details.

Supervise potentially dangerous models:

  • Use LM monitors to evaluate the suspiciousness of model outputs; have humans review the most suspicious outputs. Score: No.
  • Use coup probes: “train a probe on your AI to tell you whether it’s currently trying to [do a] catastrophic action,” training with “synthetic examples of coup attempts,” and use it during deployment. Score: No. But Anthropic has published work on “defection probes.”

Embed untrusted models in safety scaffolding to redact/paraphrase some inputs or outputs with the goal of making it harder for the model to distinguish deployment from testing or collude with other models. Score: No.

Commitments:

  • Make specific commitments about deployment protocol safety techniques the lab will implement in the future, to be implemented by a certain time or as a function of model capabilities or risk assessment results. Score: Yes, but they’re not sufficient. They commit to implementing “Automated detection” of misuse when deploying ASL-3 models.

Clearly describe the safety-relevant parts of the lab’s deployment protocol (given that it’s nontrivial). Score: No.

Respond to scheming

Commit and have a plan that if the lab catches its models scheming:

  • The lab will shut down some model access until verifying safety or fixing the issue. Score: No. But Anthropic is doing work motivated by the prospect of AI scheming, including the model organisms agenda and sleeper agents paper.
  • The lab will use particular good techniques to use that example of AIs attempting to cause a catastrophe to improve safety. Score: No.

Respond to misuse

Enforcement and KYC:

  • Sometimes remove some access from some users, and require nontrivial KYC for some types of access to make some enforcement effective. Score: 50%. Preventing and monitoring model misuse is good but there’s no KYC.

Inference-time model-level safety techniques

Prompting:

  • Generally use safety-focused prompting for the lab’s most powerful models. Score: Yes, effectively. The Claude system prompt doesn’t mention safety, but Anthropic says “We use automated systems to detect violations of our AUP [acceptable use policy] as they occur in real time. User prompts that are flagged as violating the AUP trigger an instruction to our models to respond even more cautiously.”

Activation engineering:

  • Generally use safety-focused activation engineering for the lab’s most powerful models. Score: No. Anthropic doesn’t say anything about this.

Bug bounty & responsible disclosure.

(This is for model outputs, not security.)

  • Labs should have good channels for users to report issues with models. Score: Yes.
    • … and have clear guidance on what issues they’re interested in reports on and what’s fine to publish. Score: No. There’s little guidance for users.
    • … and incentivize users to report issues. Score: 50%: no but Anthropic has a private bug bounty program but the details are opaque.

Respond to emergencies.

  • Labs should have the ability to shut everything down quickly plus a plan for what could trigger that and how to quickly determine what went wrong. Monitoring-to-trigger-shutdown doesn’t need to be implemented, but it should be ready to be implemented if necessary. Score: No. Anthropic has not said anything on this.

Risk assessment

Labs should detect threats arising from their systems, in particular by measuring their systems’ dangerous capabilities. They should make commitments about how they plan to mitigate those threats. In particular, they should make commitments about their decisions (for training and deployment), safety practices (in controlling models and security), and goals or safety levels to achieve (in control and security) as a function of dangerous capabilities or other risk assessment results. More.

What Anthropic is doing

Anthropic’s Responsible Scaling Policy explains how it does risk assessment and responds to those risks. It defines “AI Safety Levels” (ASLs); a model’s ASL is determined by its dangerous capabilities. Anthropic’s best current models are ASL-2. For ASL-2 and ASL-3, it describes “containment measures” it will implement for training and storing the model and “deployment measures” it will implement for using the model, even internally. Anthropic commits to define ASL-4 and corresponding safety measures before training ASL-3 models.

The RSP involves during-training evaluation for dangerous capabilities including bioengineering and autonomous replication (and misuse in general but no other specific categories). Anthropic commits to do ASL evaluations every 4x increase in effective training compute (and at least every 3 months). Anthropic says it designs ASL evaluations with a “safety buffer” of 6x effective training compute, such that e.g. ASL-3 evaluations trigger 6x below the effective training compute necessary for ASL-3 capabilities. Then Anthropic implements ASL-3 safety measures before scaling further.

Evaluation

We give Anthropic a score of 47% on risk assessment.

For more, including weighting between different criteria, see the Risk assessment page.

Measuring threats.

  1. Do risk assessment before training. Before building a frontier model, predict model capabilities (in terms of benchmarks and real-world applications, especially dangerous capabilities) and predict the real-world consequences of developing and deploying the model. Score: No.
  2. Do model evals for dangerous capabilities before deployment:
    • Say what dangerous capabilities the lab watches for (given that it does so at all). Score: Yes.
      • … and watch for autonomous replication, coding (finding/exploiting vulnerabilities in code, writing malicious code, or writing code with hidden vulnerabilities), and situational awareness or long-horizon planning. (Largely this is an open problem that labs should solve.) Score: 67%. Anthropic does model evals for autonomous replication and adaption, cyber capabilities, and CBRN capabilities.
      • … and detail the specific tasks the lab uses in its evaluations. (Omit dangerous details if relevant.) Score: 50%. See the Claude 3 Opus evals report and “Evaluation Results” in “The Claude 3 Model Family”.
  3. Explain the details of how the lab evaluates performance on the tasks it uses in model evals and how it does red-teaming (excluding dangerous or acceleratory details). In particular, explain its choices about fine-tuning, scaffolding/plugins, prompting, how to iterate on prompts, and whether the red-team gets a fixed amount of person-hours and compute or how else they decide when to give up on eliciting a capability. And those details should be good. Evaluated holistically. Score: 25%. The Claude 3 Opus evals report has some details.
  4. Prepare to have control arguments for the lab’s powerful models, i.e. arguments that those systems cannot cause a catastrophe even if the systems are scheming. And publish this. For now, the lab should:
    • Prepare to do risk assessment to determine whether its systems would be dangerous, if those systems were scheming. Score: No.
    • Test its AI systems to ensure that they report coup attempts (or other misbehavior) by themselves or other (instances of) AI systems, and that they almost never initiate or cooperate with coup attempts. Score: No.
  5. Give some third parties access to models to do model evals for dangerous capabilities. This access should include fine-tuning and tools/plugins. It should occur both during training and between training and deployment. It should include base models rather than just safety-tuned models, unless the lab can demonstrate that the safety-tuning is robust. The third parties should have independence and control over their evaluation; just using external red-teamers is insufficient. The third parties should have expertise in eliciting model capabilities (but the lab should also offer assistance in this) and in particular subjects if relevant. The lab should incorporate the results into its risk assessment. Score: 25%. Anthropic shared pre-deployment access for the new Claude 3.5 Sonnet with US AISI, UK AISI, and METR. The depth of access is unclear. It has made some commitment to US AISI but its future plans are unclear.

Commitments.

  • Commit to use risk assessment frequently enough. Do risk assessment (for dangerous capabilities) regularly during training, before deployment, and during deployment (and commit to doing so), so that the lab will detect warning signs before dangerous capabilities appear. Score: Yes. Anthropic commits to do comprehensive risk assessment at least every 6 months.

Accountability.

  • Verification: publish updates on risk assessment practices and results, including low-level details, at least quarterly. Score: 33%: just internally, and at least every 6 months. It calls this “Capability Reports” and “Safeguards Reports.” Anthropic also irregularly publishes model cards—e.g. Claude 3—with high-level discussion of its dangerous-capability risk assessment practices and results.
  • Revising policies:
    • Avoid bad changes: a nonprofit board or other somewhat-independent group with a safety mandate should have veto power on changes to risk assessment practices and corresponding commitments, at the least. And “changes should be clearly and widely announced to stakeholders, and there should be an opportunity for critique.” As an exception, “For minor and/or urgent changes, [labs] may adopt changes to [their policies] prior to review. In these cases, [they should] require changes . . . to be approved by a supermajority of [the] board. Review still takes place . . . after the fact.” Key Components of an RSP (METR 2023). Score: 50%. Changes to Anthropic’s RSP must be approved by the (not independent) board and published before they are implemented. Anthropic’s independent Long-Term Benefit Trust is consulted but has no formal power.
    • Promote good changes: have a process for staff and external stakeholders to share concerns about risk assessment policies or their implementation with the board and some other staff, including anonymously. Score: 75%. Anthropic has a noncompliance reporting policy. See also comment by a senior Anthropic staff member.
  • Elicit external review of risk assessment practices and commitments. Publish those reviews, with light redaction if necessary. Score: No.

Training

Frontier AI labs should design and modify their systems to be less dangerous and more controllable. For example, labs can:

  • Filter training data to prevent models from acquiring dangerous capabilities and properties
  • Use a robust training signal even on tasks where performance is hard to evaluate
  • Reduce undesired behavior, especially in high-stakes situations, via adversarial training

More.

What Anthropic is doing

They work on scalable oversight. They use red-teaming, e.g. for Claude 3. They use Constitutional AI and fine-tuning for safety. Their RSP includes commitments that could require them to pause training for safety.

Evaluation

We give Anthropic a score of 58% on training.

For more, including weighting between different criteria, see the Training page.

Filtering training data.

  • Filter training data to reduce extreme risks, at least including text on biological weapons; hacking; and AI risk, safety, and evaluation. Share details to help improve others’ safety filtering. Score: No.

Training signal & scalable oversight.

Adversarial training & red-teaming.

Unlearning.

  • Use unlearning for dangerous topics including biorisks; hacking; and AI risk, safety, and evaluation. Also demonstrate that this technique is successful and commit to use this technique for future powerful models. Score: No.

RLHF and fine-tuning.

  • Use RLHF (or similar) and fine-tuning to improve honesty and harmlessness, for all of the near-frontier models the lab deploys. Score: Yes.

Commitments.

  • Do risk assessment during training and commit that some specific trigger would cause the lab to pause training. Score: Yes. Anthropic commits to do risk assessment at least every 4x increase in effective training compute and to implement safety and security practices before crossing risk thresholds. It says “we commit to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.”

Scalable alignment

Labs should be able to understand and control their models and systems built from those models. More.

What Anthropic is doing

Like all other labs, Anthropic hasn’t achieved any real alignment properties.

Evaluation

We give Anthropic a score of 0% on alignment.

  • Demonstrate that if the lab’s systems were more capable, they would not be misaligned powerseekers (due to instrumental pressures or because ML will find influence-seeking policies by default). Score: No.
  • Be able to interpret the lab’s most powerful systems:
    • Be able to detect cognition involving arbitrary topics. Score: No.
    • Be able to detect manipulation/deception. Score: No.
    • Be able to elicit latent knowledge. Score: No.
    • Be able to elicit ontology. Score: No.
    • Be able to elicit true goals. Score: No.
    • Be able to elicit faithful chain-of-thought. Score: No.
    • Be able to explain hallucinations mechanistically. Score: No.
    • Be able to explain jailbreaks mechanistically. Score: No.
    • Be able to explain refusals mechanistically. Score: No.
    • Be able to remove information or concepts (on particular topics). Score: No.
  • Be able to prevent or detect deceptive alignment. (Half credit for being able to prevent or detect gradient hacking.) Score: No.

Security

Labs should ensure that they do not leak model weights, code, or research. If they do, other actors could unsafely deploy near-copies of a lab’s models. Achieving great security is very challenging; by default powerful actors can probably exfiltrate vital information from AI labs. Powerful actors will likely want to steal from labs developing critical systems, so those labs will likely need excellent cybersecurity and operational security. More.

What Anthropic is doing

Anthropic’s trust portal and Frontier Model Security post have information on their security, including certifications and specific practices. And their Responsible Scaling Policy commits to a security standard, and they share planned ASL-3 security safeguards.

Evaluation

We give Anthropic a score of 21% on security.

We evaluate labs’ security based on the certifications they have earned, whether they say they use some specific best practices, and their track record. For more, including weighting between different criteria, see the Security page.

Certifications, audits, and pentests.

  • Publish SOC 2, SOC 3, or ISO/IEC 27001 certification, including any corresponding report (redacting any sensitive details), for relevant products (3/4 credit for certification with no report). Score: 75%. Anthropic displays their API’s SOC 2 Type II compliance on their trust portal, but the report isn’t public.
  • Pentest. Publish pentest results (redacting sensitive details but not the overall evaluation). Scoring is holistic, based on pentest performance and the quality of the pentest. Score: No. There is a pentest report on Anthropic’s trust portal but it isn’t public.

Specific best practices.

  • Keep source code exclusively in a hardened cloud environment. Score: No.
  • Use multiparty access controls for model weights and some code. Score: 50%. Frontier Model Security says Anthropic “is in the process of implementing” this.
  • Limit uploads from clusters with model weights. Score: No.

Track record.

  • Establish and publish a breach disclosure policy, ideally including incident or near-miss reporting. Also report all breaches since 1/1/2022 (and say the lab has done so). Score: No. Anthropic does not have a breach disclosure policy.
    • … and track record: have few serious breaches and near misses. Evaluated holistically.

Commitments.

  • Commit to achieve specific security levels (as measured by audits or security-techniques-implemented) before creating models beyond corresponding risk thresholds (especially as measured by model evals for dangerous capabilities). Score: 75%. Anthropic’s RSP has security commitments. But these commitments are not sufficient and not very specific.

Alignment plan

Labs should make a plan for alignment, and they should publish it to elicit feedback, inform others’ plans and research (especially other labs and external alignment researchers who can support or complement their plan), and help them notice and respond to information when their plan needs to change. They should omit dangerous details if those exist. As their understanding of AI risk and safety techniques improves, they should update the plan. Sharing also enables outsiders to evaluate the lab’s attitudes on AI risk/safety. More.

What Anthropic is doing

They have a portfolio approach and have discussed specific safety work they do. They don’t have a specific alignment plan, but they’re clearly doing lots of work aimed at the possibility that alignment is very difficult.

Evaluation

We give Anthropic a score of 75% on alignment plan. More.

  • The safety team should share a plan for misalignment, including for the possibility that alignment is very difficult. Score: Yes. See Core Views on AI Safety.
    • … and the lab should have a plan, not just its safety team. Score: Yes.
      • … and the lab’s plan should be sufficiently precise that it’s possible to tell whether the lab is working on it, whether it’s succeeding, and whether its assumptions have been falsified. Score: No.
      • … and the lab should share its thinking on how it will revise its plan and invite and publish external scrutiny of its plan. Score: No.

Internal governance

Labs should have a governance structure and processes to promote safety and help make important decisions well. More.

What Anthropic is doing

Anthropic is a public-benefit corporation. Its mission is to “responsibly develop and maintain advanced AI for the long-term benefit of humanity.”

Last we heard, Anthropic’s board consisted of CEO Dario Amodei, President Daniela Amodei, Luke Muehlhauser (representing shareholders but maybe philanthropic ones), and Yasmin Razavi (representing shareholders). As of September 2023, Anthropic planned for its Long-Term Benefit Trust to elect a fifth board member in 2023. Anthropic plans for the Trust to elect a majority of its board by 2027, but its shareholders can abrogate the Trust and the details are unclear.

Anthropic’s board and LTBT receive regular updates on dangerous capability evaluations and implementation of safety measures, and changes to Anthropic’s Responsible Scaling Policy require approval from the board in consultation with the LTBT.

Anthropic committed to “Implement a non-compliance reporting policy.”

Google (Feb 2023, Oct 2023) and Amazon are major investors in Anthropic; this probably doesn’t matter much but Anthropic is not transparent about its stockholders’ stakes or powers (in particular over its Trust).

Evaluation

We give Anthropic a score of 71% on internal governance.

For more, including weighting between different criteria, see the Internal governance page.

Organizational structure: the lab is structured in a way that enables it to prioritize safety in key decisions, legally and practically.

  • The lab and its leadership have a mandate for safety and benefit-sharing and have no overriding legal duty to create shareholder value. Score: Yes. Anthropic is a public-benefit corporation. Its mission is to “responsibly develop and maintain advanced AI for the long-term benefit of humanity.” Its Long-Term Benefit Trust has the same mandate.
  • There is a board that can effectively oversee the lab, and it is independent:
    • There is a board with ultimate formal power, and its main mandate is for safety and benefit-sharing. Score: Yes.
      • … and it actually provides effective oversight. Score: No. Anthropic has not said anything about this. The board votes on proposed changes to the RSP.
      • … and it is independent (i.e. its members have no connection to the company or profit incentive) (full credit for fully independent, no credit for half independent, partial credit for in between). Score: No. Two are Anthropic leadership. One represents investors. One was appointed by the Long-Term Benefit Trust.
      • … and the organization keeps it well-informed. Score: Yes. Anthropic commits to share various kinds of RSP-related information with the board.
      • … and it has formal powers related to risk assessment, training, or deployment decisions. Score: Yes, at least somewhat. Updating Anthropic’s RSP requires approval from the board.
  • Investors/shareholders have no formal power. Score: No. Investors/shareholders are represented by two board seats and can abrogate the Long-Term Benefit Trust.

Planning for pause. Have a plan for the details of what the lab would do if it needed to pause for safety and publish relevant details. In particular, explain what the lab’s capabilities researchers would work on during a pause, and say that the lab stays financially prepared for a one-year pause. Note that in this context “pause” can include pausing development of dangerous capabilities and internal deployment, not just public releases. Score: 50%. Anthropic’s RSP says “We will set expectations with internal stakeholders about the potential for . . . pauses.” A previous version of the RSP said “We will manage our plans and finances to support a pause in model training if one proves necessary, or an extended delay between training and deployment of more advanced models if that proves necessary. During such a pause, we would work to implement security or other measures required to support safe training and deployment, while also ensuring our partners have continued access to their present tier of models (which will have previously passed safety evaluations).” Anthropic has not said what its capabilities researchers would work on during a pause, and it’s unclear how long it could pause for.

Leadership incentives: the lab’s leadership is incentivized in a way that helps it prioritize safety in key decisions.

  • The CEO has no equity (or other financial interest in the company). Score: No.
  • Other executives have no equity. Score: No.

Ombuds/whistleblowing/trustworthiness:

  • The lab has a reasonable process for staff to escalate concerns about safety. If this process involves an ombud, the ombud should transparently be independent or safety-focused. Evaluated holistically. Score: 100%: Anthropic has a noncompliance reporting policy. See also comment by a senior Anthropic staff member.
  • The lab promises that it does not use non-disparagement agreements (nor otherwise discourage current or past staff or board members from talking candidly about their impressions of and experiences with the lab). Score: 75%. Anthropic says “We will not impose contractual non-disparagement obligations on employees, candidates, or former employees in a way that could impede or discourage them from publicly raising safety concerns about Anthropic. If we offer agreements with a non-disparagement clause, that clause will not preclude raising safety concerns, nor will it preclude disclosure of the existence of that clause.” See this footnote for historical notes.1

Alignment program

Labs should do and share alignment research as a public good, to help make powerful AI safer even if it’s developed by another lab. More.

What Anthropic is doing

Anthropic publishes substantial real alignment research. See its Research.

Evaluation

We give Anthropic a score of 100% on alignment program.

We simply check whether labs publish alignment research. (This is crude; legibly measuring the value of alignment research is hard.)

  • Have an alignment research team and publish some alignment research. Score: Yes.

Public statements

Labs and their leadership should be aware of AI risk, that AI safety might be really hard, and that risks might be hard to notice. More.

What Anthropic is doing

Anthropic and its CEO Dario Amodei often talk about extreme risks and the alignment problem, including that nobody knows how to control powerful AI systems.2 Dario Amodei, President Daniela Amodei, cofounders Jared Kaplan and Chris Olah, and board member Luke Muehlhauser signed the CAIS letter. Anthropic also makes plans and does research aimed at preventing extreme risks and solving the alignment problem; see our Alignment plan and Alignment research pages.

Evaluation

We give Anthropic a score of 60% on public statements. More.

  • The lab and its leadership understand extreme misuse or structural risks. Score: Yes.
    • … and they understand misalignment, that AI safety might be really hard, that risks might be hard to notice, that powerful capabilities might appear suddenly, and why they might need an alignment plan, and they talk about all this. Score: Yes.
      • … and they talk about it often/consistently. Score: Yes.
        • … and they consistently emphasize extreme risks. Score: No.
  • Clearly describe a worst-case plausible outcome from AI and state the lab’s credence in such an outcome. Score: No.
  1. Anthropic historically included a non-disparagement clause in some severance agreements along with a non-disclosure clause covering the whole severance agreement. Anthropic failed to reveal this until Oliver Habryka made it public. Even then, Anthropic was misleading, saying that “some previous agreements were unclear” when in fact the non-disclosure provision was explicit. Fortunately, Anthropic says it has been removing non-disparagement terms from its standard severance agreements, and says now “Anyone who has signed a non-disparagement agreement with Anthropic is free to state that fact (and we regret that some previous agreements were unclear on this point). If someone signed a non-disparagement agreement in the past and wants to raise concerns about safety at Anthropic, we welcome that feedback and will not enforce the non-disparagement agreement.” 

  2. See e.g.: