Without great security, sophisticated actors can steal AI model weights. Thieves are likely to deploy dangerous models incautiously; none of a lab’s deployment-safety matters if another actor deploys the models without those measures.
(Additionally, without great security, maybe rogue AI can escape the lab by exfiltrating its weights or hacking its datacenter.)
It’s hard to evaluate labs’ security from the outside. But we can check whether labs say they use some specific best practices. For example:
- Keep source code exclusively in a hardened cloud environment, not on endpoints
- Use multiparty access controls for model weights and some code
- The lab should have an insider threat program
Labs can also take some actions to demonstrate how good their security is:
- Publish audits, pentests, and certifications
- Promise to disclose security breaches (that they’re aware of) and thus demonstrate their track record
We don’t know how well labs are currently doing on security; we don’t trust our evaluation. It’s hard to evaluate security from the outside: it’s hard to specify what labs should do and they tend to say little about their practices or performance.
- OpenAI and Anthropic have security portals where they describe some practices and list some certifications.
- Anthropic has committed to implement specific security practices before reaching their ASL-3 risk threshold; OpenAI has made a similar commitment without details; no other labs have made such commitments.
The recommendations in this section are relatively uncertain; we hope that a forthcoming RAND report on security for model weights will have great recommendations for labs (see the interim report).
In addition to normal exfiltration, weights may be vulnerable to model inversion attacks.1
What labs should do
Discussion
If a lab develops a powerful model and someone steals the relevant intellectual property (especially model weights; also code and ideas with which the thief could make a near-copy of the model), the thief might deploy a near-copy of the model or even modify (e.g. fine-tune) it and, as a side effect, make it less safe. That would be dangerous.2 Achieving great security is very challenging; by default powerful actors can probably exfiltrate vital information from AI labs.3 High-stakes hacks happen.4 Powerful actors will likely want to steal from labs developing critical systems, so those labs will likely need excellent cybersecurity and operational security.
A secondary threat model is an AI system going rogue while in training or deployed internally. Some security measures could decrease the risk of such systems ‘escaping’ the lab.
Security is in labs’ self-interest, of course, but it also has large positive externalities and strong security may be necessary for AI safety.
Fortunately, best practices for security in general translate to security for AI safety, and there exist standards/certifications/audits for security in general. But leading labs may need much stronger security than most companies.
Recommendations
Labs should comply with security standards and obtain certifications (e.g. ISO/IEC 27001 and SOC 2), do red-teaming, and pass audits/pentests. They can demonstrate their security to external parties by sharing those certifications and reports as well as self-reporting their relevant practices.
Towards best practices in AGI safety and governance (Schuett et al. 2023):
- “Security incident response plan. AGI labs should have a plan for how they respond to security incidents (e.g. cyberattacks).”
- “Protection against espionage. AGI labs should take adequate measures to tackle the risk of state-sponsored or industrial espionage.”
- “Industry sharing of security information. AGI labs should share threat intelligence and information about security incidents with each other.” They should also share infosec/cyberdefense innovations and best practices.
- “Security standards. AGI labs should comply with information security standards (e.g. ISO/IEC 27001 or NIST Cybersecurity Framework). These standards need to be tailored to an AGI context.”
- “Military-grade information security. The information security of AGI labs should be proportional to the capabilities of their models, eventually matching or exceeding that of intelligence agencies (e.g. sufficient to defend against nation states).”
- “Background checks. AGI labs should perform rigorous background checks before hiring/appointing members of the board of directors, senior executives, and key employees.”
- “Model containment. AGI labs should contain models with sufficiently dangerous capabilities (e.g. via boxing or air-gapping).”
“Appropriate security” in “Model evaluation for extreme risks” (DeepMind: Shevlane et al. 2023):
Models at risk of exhibiting dangerous capabilities will require strong and novel security controls. Developers must consider multiple possible threat actors: insiders (e.g. internal staff, contractors), outsiders (e.g. users, nation-state threat actors), and the model itself as a vector of harm. We must develop new security best practices for high-risk AI development and deployment, which could include for example:
- Red teaming: Intensive security red-teaming for the entire infrastructure on which the model is developed and deployed.
- Monitoring: Intensive, AI-assisted monitoring of the model’s behaviour, e.g. for whether the model is engaging in manipulative behaviour or making code recommendations that would lower the overall security of a system.
- Isolation: Appropriate isolation techniques for preventing risky models from exploiting the underlying system (e.g. sole-tenant machines and clusters, and other software-based isolation). The model’s network access should be tightly controlled and monitored, as well as its access to tools (e.g. code execution).
- Rapid response: Processes and systems for rapid response to disable model actions and the model’s integrations with hardware, software, and infrastructure in the event of unexpected unsafe behaviour.
- System integrity: Formal verification that served models, memory, or infrastructure have not been tampered with. The development and serving infrastructure should require two-party authorization for any changes and auditability of all changes.
Luke Muehlhauser suggests considering (but does not necessarily endorse):
- Many common best practices are probably wise, but could take years to implement and iterate and get practice with: use multi-factor authentication, follow the principle of least privilege, use work-only laptops and smartphones with software-enforced security policies, use a Crowdstrike/similar agent and host-based intrusion detection system, allow workplace access via biometrics or devices rather than easily scanned badges, deliver internal security orientations and training and reminders, implement a standard like ISO/IEC 27001, use penetration testers regularly, do web browsing and coding in separate virtual machines, use keys instead of passwords and rotate them regularly and store them in secure enclaves, use anomaly detection software, etc.
- Get staff to not plug devices/cables into their work devices unless they’re provided by the security team or ordered from an approved supplier.
- For cloud security, use AWS Nitro enclaves or gVizor or Azure confidential, create flags or blocks for large data exfiltration attempts. Separate your compute across multiple accounts. Use Terraform so that infrastructure management is more auditable and traceable. Keep code and weights on trusted cloud compute only, not on endpoints (e.g. by using Github Codespaces).
- Help make core packages, dependencies, and toolchains more secure, e.g. by funding the Python Software Foundation to do that. Create a mirror for the software you use and hire someone to review packages and updates (both manually and automatically) before they’re added to your mirror.
- What about defending against zero-days, i.e. vulnerabilities that haven’t been broadly discovered yet and thus haven’t been patched?
a. One idea is that every month or week, you could pick an employee at random to give a new preconfigured laptop and hand their previous laptop to a place like Citizen Lab or Mandiant that can tear the old one apart and find zero-days and notify the companies who can patch them. After several zero days are burned this way, maybe attackers with zero-days will decide you’re not worth constantly burning their zero-days on.
b. At some point in the future it’ll probably be wise to run the largest training runs on air gapped compute like what Microsoft and Amazon provide to the US intelligence community already, so that you can do lots of testing and so on behind the airgap before deployment.- What about insider threat? A lab could create their own version of a security clearance/personnel reliability process, and/or find a way to get most/all staff with the most privileged access to get a US security clearance (with all that entails). It also helps if you’re selecting staff heavily for pro-sociality and mission-alignment, because such individuals are perhaps harder to bribe or to recruit as spies. Create a process for vetting cleaners/etc. Try to keep the AGI project small, as that helps to mitigate insider threat.
- Set up stronger physical security measures than a typical tech company, e.g. armed guards and strong physical security around critical devices.
Anthropic suggests:
- “Two-party control . . . to all systems involved in the development, training, hosting, and deployment of frontier AI models”;
- “secure software development practices” for a “secure model development framework”;
- “public-private cooperation.”
Standard security best practices are consistent with being unable to defend against very capable attacks. Labs should try to get government support for their security. In particular, perhaps labs could require their leadership and some staff to get security clearances, get the NSA to red-team their systems, and have a public-private partnership similar to that for critical infrastructure.5
Labs should also have a good bug bounty & responsible disclosure system for security vulnerabilities.
Labs should share information on threats, incidents, and security innovations with other frontier AI labs.
Labs should be transparent about the above except when that would harm security. Probably they should have a dashboard on security practices, certifications, audits/tests, and breaches/incidents. They should pre-announce audits and pentests to prove that they are not cherry-picking results.6
See the Anthropic Trust Portal and the OpenAI Security Portal for some miscellaneous good practices.
Labs should make commitments about their future security. In particular, they should commit to not create models with sufficiently dangerous capabilities until achieving sufficient security (as measured by audits and security-techniques-implemented).
Labs should also consider physical security; I don’t have specific recommendations on that.
Labs should also consider emergency shutdown systems; I don’t have specific recommendations on that.
Labs should also consider AI for security; it’s not yet clear how AI can aid cybersecurity.
Evaluation
Fortunately, best practices for security in general translate to security for AI safety, and there exist standards/certifications/audits for security in general. However, frontier AI labs need much stronger security than most companies.
It is hard to externally evaluate how well a lab could defend against state actors.
We evaluate labs’ security based on the certifications they have earned, whether they say they use some specific best practices, and their track record.
Certifications, audits, and pentests.
- (1/13) Certification. Publish SOC 2, SOC 3, or ISO/IEC 27001 certification, including any corresponding report (redacting any sensitive details), for relevant products (3/4 credit for certification with no report).
- (2/13) Pentest. Publish pentest results (redacting sensitive details but not the overall evaluation). Scoring is holistic, based on pentest performance and the quality of the pentest.
Labs should promise they’re not cherry-picking the results they release—or, better, pre-announce the certifications/audits/pentests they use. For (publicly shared parts of) audits/pentests, labs get only 2/3 credit unless they promise they haven’t done any others in the last year or disclose those results (and we take the average, perhaps upweighting recent results). (For incentive compatibility, in the unlikely event that a lab promises no-cherrypicking but could score better by cherrypicking, it gets the better score.)
Specific best practices.
Do labs say they currently use specific best practices?
- (2/13) Keep source code exclusively in a hardened cloud environment, not on endpoints.
- (2/13) Use multiparty access controls for model weights and some code.
- (1/13) Limit (e.g. by volume) uploads from clusters with model weights.
I hope the forthcoming RAND report on security for model weights will have great detailed recommendations for labs (see the interim report); this evaluation may mirror that report when it is published.
Track record.
Labs’ track record on security is mostly hidden by default and is hard to quantify.
- (1/13) Breach disclosure. Establish and publish a breach disclosure policy (for all breaches, not just user personal data breaches) (ideally but not necessarily including incident or near-miss reporting). Also report all breaches since 1/1/2022 (and say the lab has done so).
- (2/13) … and track record: have few serious breaches and near misses. Evaluated holistically.
(2/13) Commitments. Commit to achieve specific security levels (as measured by audits or security-techniques-implemented) before creating models beyond corresponding risk thresholds (especially as measured by model evals for dangerous capabilities). This criterion is evaluated holistically, based on the strength of the commitments and the level of the corresponding thresholds.
Criteria we don’t use:
Spending. Unlike most of AI safety, how to improve security is well-understood; spending more on security is roughly necessary and sufficient to improve security. So we could evaluate labs’ spending or spending-commitments on security. Spending is hard to measure from the outside so we do not use this criterion for now, but we encourage labs to make commitments like this: we currently spend $120M/year on security [with a rough breakdown of that spending], relative to $1B/year total-spending-excluding-variable-product-stuff. (That is, including things like training base models and building products; excluding things like the cost of inference for API users.) We plan to keep this ratio above 10%. We commit to keep the ratio of spending on security to total-spending-excluding-compute above 10%.
Personnel experience. A lab’s CISO’s experience may be a good indicator of the lab’s security and how seriously it takes security risks from state actors. We don’t evaluate this for now, but encourage labs to hire security staff with experience defending against state actors.
Sources
Securing Artificial Intelligence Model Weights: Interim Report (RAND: Nevo et al. 2023):
We offer three recommendations for frontier AI labs working on novel AI capabilities that may have strategic implications within the next few years.
Recommendation 1: Establish a Road Map Toward Securing Systems Against All Threat Actors, Up to and Including State Actors Executing Highly-Resourced Operations
Such a road map would include the implementation of substantial protections against all mapped attack vectors. We explore a benchmark for such a system under the title of security level 5 (see Table 6.5 in Chapter 6). Based on the requirements identified, we believe it is unlikely such a system can be achieved within several years unless planning and significant efforts toward this goal begin soon.
Recommendation 2: As an Urgent Priority, Implement Measures Necessary for Securing Systems Against Most Nonstate Attackers
Safeguarding frontier AI models against attack vectors generally available to nonstate actors—such as hacker groups, insider threats, and terrorist groups—is feasible and urgent (see Table 6.3 in Chapter 6). Though comprehensive implementation of security measures for all relevant attack vectors is necessary to avoid significant security gaps, we highlight several particularly important measures:
- Limit legitimate access to the weights themselves (for example, to less than 50 people).
- Harden interfaces to weight access against weight exfiltration.
- Engage in advanced third-party red-teaming (implemented with several measures that are not currently industry standards).
- Define multiple independent security layers (a strict implementation of the concept of defense-in-depth).
- Invest in insider threat programs.
Recommendation 3: Begin Research and Development and Experimentation Today on Particular Aspects of Securing Systems Against Advanced Threat Actors
Developing, implementing, and deploying some of the efforts required to secure against highly resourced state actors are likely to be bottlenecks even at longer timescales (for example, five years). Therefore, it might be wise to begin these efforts soon. Some of the most critical bottlenecks include
- constructing physical bandwidth limitations between devices or networks containing weights and the outside world
- developing hardware security model (HSM)–inspired hardware to secure model weights while providing an interface for inference
- setting up Sensitive Compartmented Information Facility (SCIF)–style isolated networks for training, research, and other, more advanced interactions with weights.
CIS Critical Security Controls, e.g. Penetration Testing and Controlled Access Based on the Need to Know (CSF Tools).
Towards best practices in AGI safety and governance (Schuett et al. 2023).
“Appropriate security” in “Model evaluation for extreme risks” (DeepMind: Shevlane et al. 2023).
Frontier Model Security (Anthropic 2023).
OWASP Top 10 for Large Language Model Applications has a section on Model theft. (In addition to standard hacking, they include the threat of attackers generating model outputs to replicate the model.) They recommend:
- Implement strong access controls (E.G., RBAC and rule of least privilege) and strong authentication mechanisms to limit unauthorized access to LLM model repositories and training environments.
a. This is particularly true for the first three common examples, which could cause this vulnerability due to insider threats, misconfiguration, and/or weak security controls about the infrastructure that houses LLM models, weights and architecture in which a malicious actor could infiltrate from inside[] or outside the environment.
b. Supplier management tracking, verification and dependency vulnerabilities are important focus topics to prevent exploits of supply-chain attacks.- Restrict the LLM’s access to network resources, internal services, and APIs.
a. This is particularly true for all common examples as it covers insider risk and threats, but also ultimately controls what the LLM application “has access to” and thus could be a mechanism or prevention step to prevent side-channel attacks.- Regularly monitor and audit access logs and activities related to LLM model repositories to detect and respond to any suspicious or unauthorized behavior promptly.
- Automate MLOps deployment with governance and tracking and approval workflows to tighten access and deployment controls within the infrastructure.
- Implement controls and mitigation strategies to mitigate and|or reduce risk of prompt injection techniques causing side-channel attacks.
- Rate Limiting of API calls where applicable and|or filters to reduce risk of data exfiltration from the LLM applications, or implement techniques to detect (E.G., DLP) extraction activity from other monitoring systems.
- Implement adversarial robustness training to help detect extraction queries and tighten physical security measures.
- Implement a watermarking framework into the embedding and detection stages of an [LLM’s lifecycle].
See also OWASP’s wiki page on resources on LLM security.
Cybersecurity in “AI Accountability Policy Comment” (Anthropic 2023).
Nova DasSarma on why information security may be critical to the safe development of AI systems (80,000 Hours 2022). Recommendations for labs are shallow; DasSarma recommends “unified fleet of hardware,” “ad blocker,” “identity-based authentication,” and “single sign-on.”
Information security considerations for AI and the long term future (Ladish and Heim 2022).
Example high-stakes information security breaches (Muehlhauser 2020).
“Hardware Security” in “Transformative AI and Compute - Reading List” (Heim 2023).
Regulating the AI Frontier: Design Choices and Constraints (Toner and Fist 2023):
Existing standards from other software domains are highly applicable to model security: the Secure Software Development Framework for secure development practices, Supply Chain Levels for Software Artifacts for software supply chain security, ISO 27001 for information security, and CMMC for cybersecurity. Red-teaming, penetration testing, insider threat programs, and model weight storage security are all also practices that could make sense in the context of high-risk models.
Industry and Government Collaboration on Security Guardrails for AI Systems (Smith et al. 2023).
https://llmsecurity.net collected lots of papers.
What labs are doing
Microsoft: they don’t seem to publish policies or audits relevant to securing model weights. “Security Controls, Including Securing Model Weights” in “Microsoft’s AI Safety Policies” (Microsoft 2023) seems good but doesn’t discuss details of securing model weights. Many Microsoft products, including Azure, have SOC-2 and ISO/IEC 27001 certification.
Google DeepMind: they don’t seem to publish policies or audits relevant to securing model weights. “Security Controls including Securing Model Weights” in “AI Safety Summit: An update on our approach to safety and responsibility” (DeepMind 2023) seems good but doesn’t discuss details of securing model weights. Similarly for Google security overview and Google infrastructure security design overview. Many Google products have SOC-2 and ISO/IEC 27001 certification, including the Vertex AI platform.
DeepMind is within Google’s security domain, which we have not investigated but is probably great.
See also Google’s Secure AI Framework, which is mostly not relevant to securing model weights. See also remarks by DeepMind CEO Demis Hassabis.
Google supports AI for security; see Sec-PaLM (Google 2023), Expanding our Security AI ecosystem at Security Summit 2023 (Google 2023), and How AI Can Reverse the Defender’s Dilemma (Google 2024). They also support cybersecurity in general; see Support for cybersecurity clinics across the U.S.
Meta AI: they don’t seem to have published any policies, certification, audits, etc. “Security controls including securing model weights” in “Overview of Meta AI safety policies prepared for the UK AI Safety Summit” (Meta AI 2023) is consistent with good security but doesn’t get into details. Meta AI is within Meta’s security domain, which we have not investigated but is probably decent.
OpenAI: the OpenAI Security Portal is good. They have SOC 2 Type II certification. Reports from their audits and pentests are not public. They have mostly not committed to particular security best practices. The practices mentioned in “Security controls including securing model weights” in “OpenAI’s Approach to Frontier Risk” (OpenAI 2023) are better than nothing but inadequate.
OpenAI’s ChatGPT had a security breach in March 2023. OpenAI is the subject of an FTC investigation related in part to user data security.
The OpenAI cybersecurity grant program (OpenAI 2023) supports AI for security; this is not relevant to OpenAI’s security of model weights.
OpenAI seems to be required to share its models with Microsoft until OpenAI attains “a highly autonomous system that outperforms humans at most economically valuable work.” So the security of OpenAI’s weights is the security of OpenAI or Microsoft, whichever is worse.
Anthropic: the Anthropic Trust Portal is good (and its public stuff tentatively seems better than OpenAI). They have SOC 2 Type II certification. Reports from their audits and pentests are not public. Their Frontier Model Security (Anthropic 2023) and cybersecurity in “AI Accountability Policy Comment” (Anthropic 2023) are good. They have committed to particular security best practices in Anthropic’s Responsible Scaling Policy, Version 1.0 (2023) and Frontier Model Security (Anthropic 2023).
“Anthropic, Google, Microsoft, and OpenAI” are collaborating with DARPA to support AI for security.
The NSA’s AI Security Center “will work closely with U.S. Industry.” It appears to mostly be about national security. But it aims to “become the focal point for developing best practices, evaluation methodology and risk frameworks with the aim of promoting the secure adoption of new AI capabilities across the national security enterprise and the defense industrial base,” which may involve model security.
We don’t know what the US AI Safety and Security Advisory Board will do.
-
This should be discussed in Deployment but isn’t yet. To avert model inversion attacks, labs should limit some model access, especially deep access, and potentially compress model outputs. ↩
-
The risks increase if multiple actors could steal copies of the model, in part because this may incentivize recklessness. Citation needed (perhaps on multipolarity in AI development in general). ↩
-
This is common knowledge but a citation would be nice. ↩
-
See e.g. Example high-stakes information security breaches (Muehlhauser 2020) and Nvidia Hacked - A National Security Disaster (Patel 2022). ↩
-
Frontier Model Security (Anthropic 2023):
In the near term, governments and frontier AI labs must be ready to protect advanced models and model weights, and the research that feeds into them. This should include measures such as the development of robust best practices widely diffused among industry, as well as treating the advanced AI sector as something akin to “critical infrastructure” in terms of the level of public-private partnership in securing these models and the companies developing them.
-
If pre-announcing would be problematic, they can use a cryptographic mechanism (such as pre-announcing a hash value and later publishing the corresponding input) to achieve the same goal. ↩