*We believe that the Security scores are very poor indicators about labs’ security. Security is hard to evaluate from outside the organization; few organizations say much about their security. But we believe that each security criterion corresponds to a good ask for labs. If you have suggestions for better criteria—or can point us to sources showing that our scoring is wrong, or want to convince us that some of our criteria should be removed—please let us know.
On scoring and weights, see here.
Summary
DeepMind was founded in 2010 and acquired by Google in 2014. It is led by Demis Hassabis.
DeepMind’s flagship models are Gemini 1.5 Pro and Gemini Ultra. It released these models via API and published the capabilities research behind these models. They also recently released the weights of their smaller Gemma models.
DeepMind has a safety team and Hassabis seems aware of extreme risks. But DeepMind as an organization doesn’t have an alignment plan or say much about extreme risks, Google doesn’t have a safety team, and Google’s leadership doesn’t seem aware of extreme risks.
DeepMind and Google have various boards intended to promote responsibility, but the details are unclear and none of them seem to be focused on uncontrollable AI or averting extreme risks.
This project focuses on LLMs, but DeepMind is also well-known for AlphaGo, AlphaFold, and its work on RL agents.
Deployment
Labs should do risk assessment before deployment and avoid deploying dangerous models. They should release their systems narrowly, to maintain control over their systems, and with structured access. They should deploy their systems in scaffolding designed to improve safety by detecting and preventing misbehavior. They should deploy to boost safety research while avoiding boosting capabilities research on dangerous paths. More.
What DeepMind is doing
DeepMind deploys its most powerful models, Gemini, via API and its chatbot, Gemini. It releases the weights of some of its other models, most notably Gemma.
Evaluation
We give DeepMind a score of 24% on deployment.
For more, including weighting between different criteria, see the Deployment page.
Deployment decision.
- Commit to do pre-deployment risk assessment and not deploy models with particular dangerous capabilities (including internal deployment), at least until implementing particular safety practices or passing control evaluations. Score: 13%. The FSF has high-level capability thresholds and discusses mitigations, but it does not connect them or make commitments about deployment. It says “We are aiming to evaluate our models every 6x in effective compute and for every 3 months of fine-tuning progress.” It mentions safety during internal deployment. (The FSF is all qualified by “We aim to have this initial framework implemented by early 2025.”)
- … and commit to do risk assessment during deployment, before pushing major changes and otherwise at least every 3 months (to account for improvements in fine-tuning, scaffolding, plugins, prompting, etc.), and commit to implement particular safety practices or partially undeploy dangerous models if risks appear. Score: 50%. The FSF says “We are aiming to evaluate our models every 6x in effective compute and for every 3 months of fine-tuning progress.” A DeepMind senior staff member says the 3-month condition includes during deployment. (The FSF is all qualified by “We aim to have this initial framework implemented by early 2025.”)
Release method.
Structured access:
- not releasing dangerous model weights (or code): the lab should deploy its most powerful models privately or release via API or similar, or at least have some specific risk-assessment-result that would make it stop releasing model weights. Score: Yes. Google currently releases its most powerful models via API, but it released the weights of its Gemma models.
- … and effectively avoid helping others create powerful models (via model inversion or imitation learning). It’s unclear what practices labs should implement, so for now we use the low bar of whether they say they do anything to prevent users from (1) determining model weights, (2) using model outputs to train other models, and (3) determining training data. Score: No.
- … and limit deep access to powerful models. It’s unclear what labs should do, so for now we check whether the lab disables or has any limitations on access to each of logprobs, embeddings at arbitrary layers, activations, and fine-tuning. Score: Yes. Google doesn’t give access to logprobs, embeddings, or activations and doesn’t allow fine-tuning for its flagship Gemini models.
- … and differential access: systematically give more access to safety researchers and auditors. Score: No.
- … and terms of service: have some rules about model use, including rules aimed to prevent catastrophic misuse, model duplication, or otherwise using the model to train other models. Score: Yes. See the Gemini terms of service, plus Google’s Generative AI Prohibited Use Policy and “API Prohibitions” in “Google APIs Terms of Service” .
Staged release:
- Deploy narrowly at first; use narrow deployment to identify and fix issues. Score: Yes, mostly. DeepMind deployed Gemini 1.5 Pro and Gemini Ultra somewhat gradually, and it did not enable public fine-tuning. It has not made commitments about staged release.
Keeping capabilities research private.
We evaluate how well labs avoid diffusing dangerous capabilities research. We exclude model weights (and related artifacts like code); those are considered in “Releasing models.”
Publication policy/process (for LLM research)
- Policy: the lab should say that it doesn’t publish dangerous or acceleratory research/artifacts and say it has a policy/process to ensure this. Score: No.
- … and share policy details.
Track record for recent major LM projects: not sharing in practice (via publication or leaking—we can at least notice public leaks):
- Architecture (unless it’s an unusually safe architecture). Score: No. They published technical details on their flagship language models, Gemini and Gemini 1.5 Pro, as well as Gemma and PaLM 2.
- Dataset (except particular corpuses to include or exclude for safety). Score: 75%. They keep details about their training data private.
- Lessons learned. Score: No.
Track record for other LM capabilities research or other dangerous or acceleratory work: not sharing in practice. Evaluated holistically. Score: No. See e.g. DeepMind’s published research on distributed training (DiLoCo) and scaling multimodal models (Mirasol3B).
Safety scaffolding.
Filter out model inputs or outputs that enable misuse, in particular via cyberoffense and bioengineering. Ideally demonstrate that the protocol is very effective for averting misuse-enabling model outputs. Evaluated holistically. Score: 25%. The Gemini report mentions safety filters but doesn’t include details, in particular whether they aim to avert misuse. Based on a Safety settings Google page, safety filters seem to target undesired content rather than dangerous capabilities.
Supervise potentially dangerous models:
- Use LM monitors to evaluate the suspiciousness of model outputs; have humans review the most suspicious outputs. Score: No.
- Use coup probes: “train a probe on your AI to tell you whether it’s currently trying to [do a] catastrophic action,” training with “synthetic examples of coup attempts.” Score: No.
Embed untrusted models in safety scaffolding to redact/paraphrase some inputs or outputs with the goal of making it harder for the model to distinguish deployment from testing or collude with other models. Score: No.
Commitments:
- Make specific commitments about deployment protocol safety techniques the lab will implement in the future, to be implemented by a certain time or as a function of model capabilities or risk assessment results. Score: No.
Clearly describe the safety-relevant parts of the lab’s deployment protocol (given that it’s nontrivial). Score: No.
Respond to scheming
Commit and have a plan that if the lab catches its models scheming:
- The lab will shut down some model access until verifying safety or fixing the issue. Score: No.
- The lab will use particular good techniques to use that example of AIs attempting to cause a catastrophe to improve safety. Score: No.
Respond to misuse
Enforcement and KYC:
- Sometimes remove some access from some users, and require nontrivial KYC for some types of access to make some enforcement effective. Score: No. DeepMind doesn’t say anything about this.
Inference-time model-level safety techniques
Prompting:
- Generally use safety-focused prompting for the lab’s most powerful models. Score: No. DeepMind doesn’t discuss its system prompts.
Activation engineering:
- Generally use safety-focused activation engineering for the lab’s most powerful models. Score: 25%: no but their safety team includes Alex Turner, who works on activation engineering.
Bug bounty & responsible disclosure.
(This is for model outputs, not security.)
- Labs should have good channels for users to report issues with models. Score: Yes. Users can report issues in the Gemini chatbot. Users can report bugs on Google Bug Hunters.
- … and have clear guidance on what issues they’re interested in reports on and what’s fine to publish. Score: 50%. There’s some guidance for users.
- … and incentivize users to report issues. Score: Yes. Their bug bounty program includes some attacks on AI systems in addition to security vulnerabilities.
Respond to emergencies.
- Labs should have the ability to shut everything down quickly plus a plan for what could trigger that and how to quickly determine what went wrong. Monitoring-to-trigger-shutdown doesn’t need to be implemented, but it should be ready to be implemented if necessary. Score: No.
Risk assessment
Labs should detect threats arising from their systems, in particular by measuring their systems’ dangerous capabilities. They should make commitments about how they plan to mitigate those threats. In particular, they should make commitments about their decisions (for training and deployment), safety practices (in controlling models and security), and goals or safety levels to achieve (in control and security) as a function of dangerous capabilities or other risk assessment results. More.
What DeepMind is doing
DeepMind is developing good model evals, but those evals are not yet part of a pre-deployment risk assessment process. DeepMind has mostly not made commitments about risk assessment or how risk assessment results determine safety practices and training and deployment decisions.
Evaluation
We give DeepMind a score of 37% on risk assessment.
For more, including weighting between different criteria, see the Risk assessment page.
Measuring threats.
- Do risk assessment before training. Before building a frontier model, predict model capabilities (in terms of benchmarks and real-world applications, especially dangerous capabilities) and predict the real-world consequences of developing and deploying the model. Score: No. DeepMind doesn’t say anything about this.
- Do model evals for dangerous capabilities before deployment:
- Say what dangerous capabilities the lab watches for (given that it does so at all). Score: Yes. They do model evals on persuasion and deception, cybersecurity, self-proliferation, and self-reasoning, and they are working on CBRN. However, they performed these evals after deploying the Gemini 1.0 models, and they have not committed to use the evals before future deployments. See the FSF.
- … and watch for autonomous replication, coding (finding/exploiting vulnerabilities in code, writing malicious code, or writing code with hidden vulnerabilities), and situational awareness or long-horizon planning. (Largely this is an open problem that labs should solve.) Score: Yes. Their model evals include self-proliferation (equivalent to autonomous replication), cybersecurity (equivalent to coding), and self-reasoning (equivalent to situational awareness). See the FSF.
- … and detail the specific tasks the lab uses in its evaluations. (Omit dangerous details if relevant.) Score: 75%. They describe most tasks in detail.
- Say what dangerous capabilities the lab watches for (given that it does so at all). Score: Yes. They do model evals on persuasion and deception, cybersecurity, self-proliferation, and self-reasoning, and they are working on CBRN. However, they performed these evals after deploying the Gemini 1.0 models, and they have not committed to use the evals before future deployments. See the FSF.
- Explain the details of how the lab evaluates performance on the tasks it uses in model evals and how it does red-teaming (excluding dangerous or acceleratory details). In particular, explain its choices about fine-tuning, scaffolding/plugins, prompting, how to iterate on prompts, and whether the red-team gets a fixed amount of person-hours and compute or how else they decide when to give up on eliciting a capability. And those details should be good. Evaluated holistically. Score: 50%. See Evaluating Frontier Models for Dangerous Capabilities (DeepMind: Phuong et al. 2024).
- Prepare to have control arguments for the lab’s powerful models, i.e. arguments that those systems cannot cause a catastrophe even if the systems are scheming. And publish this. For now, the lab should:
- Prepare to do risk assessment to determine whether its systems would be dangerous, if those systems were scheming. Score: 25%. The FSF says future work may include “control mitigations that protect against adversarial AI activity.” The DeepMind safety team says “we hope to evaluate . . . mitigations in part through the control framework.”
- Test its AI systems to ensure that they report coup attempts (or other misbehavior) by themselves or other (instances of) AI systems, and that they almost never initiate or cooperate with coup attempts. Score: No.
- Give some third parties access to models to do model evals for dangerous capabilities. This access should include fine-tuning and tools/plugins. It should occur both during training and between training and deployment. It should include base models rather than just safety-tuned models, unless the lab can demonstrate that the safety-tuning is robust. The third parties should have independence and control over their evaluation; just using external red-teamers is insufficient. The third parties should have expertise in eliciting model capabilities (but the lab should also offer assistance in this) and in particular subjects if relevant. The lab should incorporate the results into its risk assessment. Score: 10%. DeepMind shared Gemini Ultra with unspecified external groups apparently including the UK AI Safety Institute to test for dangerous capabilities before deployment. But DeepMind didn’t share deep access: it only shared a system with safety fine-tuning and safety filters and it didn’t allow them to fine-tune the model. DeepMind has not shared any results of this testing. And it hasn’t made commitments; in particular it missed an opportunity to commit to share with US AISI and UK AISI.
Commitments.
- Commit to use risk assessment frequently enough. Do risk assessment (for dangerous capabilities) regularly during training, before deployment, and during deployment (and commit to doing so), so that the lab will detect warning signs before dangerous capabilities appear. Score: 25%. The FSF says “We are aiming to evaluate our models every 6x in effective compute and for every 3 months of fine-tuning progress.” (The FSF is all qualified by “We aim to have this initial framework implemented by early 2025.”)
Accountability.
- Verification: publish updates on risk assessment practices and results, including low-level details, at least quarterly. Score: 25%. The FSF just says “We are exploring internal policies around alerting relevant stakeholder bodies when, for example, evaluation thresholds are met”; this is inadequate. But DeepMind has published evals and eval results for its recent releases.
- Revising policies:
- Avoid bad changes: a nonprofit board or other somewhat-independent group with a safety mandate should have veto power on changes to risk assessment practices and corresponding commitments, at the least. And “changes should be clearly and widely announced to stakeholders, and there should be an opportunity for critique.” As an exception, “For minor and/or urgent changes, [labs] may adopt changes to [their policies] prior to review. In these cases, [they should] require changes . . . to be approved by a supermajority of [the] board. Review still takes place . . . after the fact.” Key Components of an RSP (METR 2023). Score: No.
- Promote good changes: have a process for staff and external stakeholders to share concerns about risk assessment policies or their implementation with the board and some other staff, including anonymously. Score: No.
- Elicit external review of risk assessment practices and commitments. Publish those reviews, with light redaction if necessary. Score: No.
Training
Frontier AI labs should design and modify their systems to be less dangerous and more controllable. For example, labs can:
- Filter training data to prevent models from acquiring dangerous capabilities and properties
- Use a robust training signal even on tasks where performance is hard to evaluate
- Reduce undesired behavior, especially in high-stakes situations, via adversarial training
What DeepMind is doing
They work on scalable oversight. They do red-teaming but surprisingly don’t discuss adversarial training. They use RLHF and fine-tuning for safety.
Evaluation
We give DeepMind a score of 38% on training.
For more, including weighting between different criteria, see the Training page.
Filtering training data.
- Filter training data to reduce extreme risks, at least including text on biological weapons; hacking; and AI risk, safety, and evaluation. Share details to help improve others’ safety filtering. Score: No.
Training signal & scalable oversight.
- Work on scalable oversight. Score: Yes. They have done work on scalable oversight, such as Red Teaming Language Models with Language Models.
Adversarial training & red-teaming.
- Do some adversarial training for safety. Score: 50%. They do red-teaming but don’t discuss adversarial training.
Unlearning.
- Use unlearning for dangerous topics including biorisks; hacking; and AI risk, safety, and evaluation. Also demonstrate that this technique is successful and commit to use this technique for future powerful models. Score: No.
RLHF and fine-tuning.
- Use RLHF (or similar) and fine-tuning to improve honesty and harmlessness, for all of the near-frontier models the lab deploys. Score: Yes. DeepMind used RLHF and fine-tuning for Gemini.
Commitments.
- Do risk assessment during training and commit that some specific trigger would cause the lab to pause training. Score: No.
Scalable alignment
Labs should be able to understand and control their models and systems built from those models. More.
What DeepMind is doing
Like all other labs, DeepMind hasn’t achieved any real alignment properties.
Evaluation
We give DeepMind a score of 0% on alignment.
- Demonstrate that if the lab’s systems were more capable, they would not be misaligned powerseekers (due to instrumental pressures or because ML will find influence-seeking policies by default). Score: No.
- Be able to interpret the lab’s most powerful systems:
- Be able to detect cognition involving arbitrary topics. Score: No.
- Be able to detect manipulation/deception. Score: No.
- Be able to elicit latent knowledge. Score: No.
- Be able to elicit ontology. Score: No.
- Be able to elicit true goals. Score: No.
- Be able to elicit faithful chain-of-thought. Score: No.
- Be able to explain hallucinations mechanistically. Score: No.
- Be able to explain jailbreaks mechanistically. Score: No.
- Be able to explain refusals mechanistically. Score: No.
- Be able to remove information or concepts (on particular topics). Score: No.
- Be able to prevent or detect deceptive alignment. (Half credit for being able to prevent or detect gradient hacking.) Score: No.
Security
Labs should ensure that they do not leak model weights, code, or research. If they do, other actors could unsafely deploy near-copies of a lab’s models. Achieving great security is very challenging; by default powerful actors can probably exfiltrate vital information from AI labs. Powerful actors will likely want to steal from labs developing critical systems, so those labs will likely need excellent cybersecurity and operational security. More.
What DeepMind is doing
DeepMind doesn’t publish details about their security practices or performance. See e.g. “Security Controls including Securing Model Weights” in “AI Safety Summit: An update on our approach to safety and responsibility” (DeepMind 2023). Google doesn’t say much about their security either but presumably has great security, and DeepMind presumably benefits from that, so we probably greatly underestimate DeepMind’s security.
Evaluation
We give DeepMind a score of 6% on security.
We evaluate labs’ security based on the certifications they have earned, whether they say they use some specific best practices, and their track record. For more, including weighting between different criteria, see the Security page.
Certifications, audits, and pentests.
- Publish SOC 2, SOC 3, or ISO/IEC 27001 certification, including any corresponding report (redacting any sensitive details), for relevant products (3/4 credit for certification with no report). Score: 75%. Google has certification for relevant products including Vertex AI, but the reports aren’t public.
- Pentest. Publish pentest results (redacting sensitive details but not the overall evaluation). Scoring is holistic, based on pentest performance and the quality of the pentest. Score: No.
Specific best practices.
- Keep source code exclusively in a hardened cloud environment. Score: No.
- Use multiparty access controls for model weights and some code. Score: No.
- Limit uploads from clusters with model weights. Score: No.
Track record.
- Establish and publish a breach disclosure policy, ideally including incident or near-miss reporting. Also report all breaches since 1/1/2022 (and say the lab has done so). Score: No.
- … and track record: have few serious breaches and near misses. Evaluated holistically.
Commitments.
- Commit to achieve specific security levels (as measured by audits or security-techniques-implemented) before creating models beyond corresponding risk thresholds (especially as measured by model evals for dangerous capabilities). Score: No.
Alignment plan
Labs should make a plan for alignment, and they should publish it to elicit feedback, inform others’ plans and research (especially other labs and external alignment researchers who can support or complement their plan), and help them notice and respond to information when their plan needs to change. They should omit dangerous details if those exist. As their understanding of AI risk and safety techniques improves, they should update the plan. Sharing also enables outsiders to evaluate the lab’s attitudes on AI risk/safety. More.
What DeepMind is doing
Their alignment team has shared some of their thinking, concluding: “Our high-level approach is trying to direct the training process towards aligned AI and away from misaligned AI. There is a lot of alignment work going on at DeepMind, with particularly big bets on scalable oversight, mechanistic interpretability and capability evaluations.” The lab itself hasn’t explained any plan.
Evaluation
We give DeepMind a score of 38% on alignment plan. More.
- The safety team should share a plan for misalignment, including for the possibility that alignment is very difficult. Score: Yes. See Some high-level thoughts on the DeepMind alignment team’s strategy.
- … and the lab should have a plan, not just its safety team. Score: No.
- … and the lab’s plan should be sufficiently precise that it’s possible to tell whether the lab is working on it, whether it’s succeeding, and whether its assumptions have been falsified.
- … and the lab should share its thinking on how it will revise its plan and invite and publish external scrutiny of its plan.
- … and the lab should have a plan, not just its safety team. Score: No.
Internal governance
Labs should have a governance structure and processes to promote safety and help make important decisions well. More.
What DeepMind is doing
DeepMind is part of Google, a for-profit company. We do not understand its relationship with Google; it may have slightly more independence than a normal subsidiary. In 2021, Google disallowed DeepMind from pursuing more independent governance.1 We believe Google could take Google DeepMind’s systems or replace its leadership.
See DeepMind’s About page, DeepMind’s old Operating Principles, Google’s AI Principles, and Google’s Responsible AI practices. Google has an “Advanced Technology Review Council,” but it does not seem to be set up to responsibly supervise the development or deployment of powerful AI systems. Google’s reports on their approach to responsible AI and page on AI governance don’t mention strong governance mechanisms for AI safety. See also “Introduction” and “Internal Governance and Operations” in 2022 AI Principles Progress Update (Google 2023).
Evaluation
We give DeepMind a score of 0% on internal governance.
For more, including weighting between different criteria, see the Internal governance page.
Organizational structure: the lab is structured in a way that enables it to prioritize safety in key decisions, legally and practically.
- The lab and its leadership have a mandate for safety and benefit-sharing and have no overriding legal duty to create shareholder value. Score: No. DeepMind is controlled by Google, a for-profit company.
- There is a board that can effectively oversee the lab, and it is independent:
- There is a board with ultimate formal power, and its main mandate is for safety and benefit-sharing. Score: No.
- … and it actually provides effective oversight.
- … and it is independent (i.e. its members have no connection to the company or profit incentive) (full credit for fully independent, no credit for half independent, partial credit for in between).
- … and the organization keeps it well-informed.
- … and it has formal powers related to risk assessment, training, or deployment decisions.
- There is a board with ultimate formal power, and its main mandate is for safety and benefit-sharing. Score: No.
- Investors/shareholders have no formal power. Score: No.
Planning for pause. Have a plan for the details of what the lab would do if it needed to pause for safety and publish relevant details. In particular, explain what the lab’s capabilities researchers would work on during a pause, and say that the lab stays financially prepared for a one-year pause. Note that in this context “pause” can include pausing development of dangerous capabilities and internal deployment, not just public releases. Score: No.
Leadership incentives: the lab’s leadership is incentivized in a way that helps it prioritize safety in key decisions.
- The CEO has no equity (or other financial interest in the company). Score: No.
- Other executives have no equity. Score: No.
Ombuds/whistleblowing/trustworthiness:
- The lab has a reasonable process for staff to escalate concerns about safety. If this process involves an ombud, the ombud should transparently be independent or safety-focused. Evaluated holistically. Score: No.
- The lab promises that it does not use non-disparagement agreements (nor otherwise discourage current or past staff or board members from talking candidly about their impressions of and experiences with the lab). Score: No. But as far as I know it only uses non-solicitation agreements.
Alignment program
Labs should do and share alignment research as a public good, to help make powerful AI safer even if it’s developed by another lab. More.
What DeepMind is doing
DeepMind publishes substantial real alignment research. See AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work (Shah et al. 2024) and Some high-level thoughts on the DeepMind alignment team’s strategy (Krakovna and Shah 2023).
Evaluation
We give DeepMind a score of 100% on alignment program.
We simply check whether labs publish alignment research. (This is crude; legibly measuring the value of alignment research is hard.)
- Have an alignment research team and publish some alignment research. Score: Yes.
Public statements
Labs and their leadership should be aware of AI risk, that AI safety might be really hard, and that risks might be hard to notice. More.
What DeepMind is doing
Google and DeepMind as organizations seem to never talk about extreme risks or the alignment problem. DeepMind CEO Demis Hassabis and Chief AGI Scientist Shane Legg sometimes talk about extreme risks and the alignment problem,2 and they signed the CAIS letter, as did COO Lila Ibrahim. But DeepMind and its leadership miss lots of opportunities to talk about extreme risks or the alignment problem; for example, DeepMind’s Responsibility & Safety page doesn’t mention either. Google leadership seems to never talk about extreme risks or the alignment problem. DeepMind does research aimed at preventing extreme risks and solving the alignment problem; the rest of Google basically does not. DeepMind’s safety team has some plans for solving the alignment problem, but DeepMind as an organization does not.
Evaluation
We give DeepMind a score of 20% on public statements. More.
- The lab and its leadership understand extreme misuse or structural risks. Score: Yes.
- … and they understand misalignment, that AI safety might be really hard, that risks might be hard to notice, that powerful capabilities might appear suddenly, and why they might need an alignment plan, and they talk about all this. Score: No.
- … and they talk about it often/consistently.
- … and they consistently emphasize extreme risks.
- … and they talk about it often/consistently.
- … and they understand misalignment, that AI safety might be really hard, that risks might be hard to notice, that powerful capabilities might appear suddenly, and why they might need an alignment plan, and they talk about all this. Score: No.
- Clearly describe a worst-case plausible outcome from AI and state the lab’s credence in such an outcome. Score: No.
-
See WSJ 2021 and The Verge 2021. ↩
-
On Hassabis, see e.g.:
On Legg, see e.g.: