AI Lab Watch

This post is incomplete. I'm publishing it because it might be helpful for some readers anyway. A good version of this post would be more detailed: for each proposed action, explain motivation and high-level goal, explain problem to solve, respond to obvious questions/objections, and lay out costs involved and costs worth accepting.

These are the actions I think companies that will build very powerful AI systems should prioritize to improve safety. I think something like these actions are cost-effective,[*] but specific details may be suboptimal. And some actions are not fully specified; some are of the form do good things in area X or avoid risk Y, without pinning down what to do. And this is not certain to be correct in the future; I just think it's a good plan for now. Some of these actions are costly, but I claim that they are cheap and important enough that a responsible frontier AI company should do them unilaterally — the gain in safety is worth the cost in competitiveness. I mostly don't justify these actions in this post.

(This post is about responsible labs — asks for non-responsible labs should be more focused on transparency and becoming responsible. This post is about unilateral action — in coordinated action you can accept more costs.)

Security for model weights and code

Extremely strong security will be crucial when AIs can massively accelerate AI R&D. But operating under extreme security—security sufficient to defend against top-priority operations by top state actors—is costly and not yet necessary. So labs should focus on achieving extreme security optionality—the ability to implement extreme security quickly[*]—before it would be very bad for AI model weights to be stolen.[*] It seems such AI could plausibly appear as early as 2027; if so, labs should plan to achieve extreme security optionality before 2027.[*] The lab should develop a roadmap to achieve extreme security optionality, publicly explain and justify its high-level security goals, and maybe share some details and publish stuff to validate claims (e.g. a pentester's summary of their pentest results) with sensitive details redacted.[*] It is also reasonable to aim for a weaker level of security, SL5, for now, but that will probably be insufficient when AIs can massively accelerate research.

Control

Anticipate and prevent the risk that AIs will scheme—i.e., act safe and subtly subvert safety measures to gain power—and escape during internal deployment.[*] The lab should do evals for scheming-relevant capabilities, and by the time models can plausibly scheme coherently, be ready to ensure control, e.g.:

Use monitoring and auditing techniques in deployment
- AI monitors read all inputs and outputs; the most suspicious outputs are flagged for human review; moderately suspicious outputs are resampled (potentially from a trusted, weaker model) or edited by a trusted model
Use control evals based on the deployment protocol: determine how good the protocol is assuming the model is scheming
Try to directly test whether the model is scheming (even though this is unreliable): test the model on inputs that might cause it to misbehave in simulated high-stakes deployment settings, and test it in simulated successful escapes
If the lab catches scheming, take it seriously: tell everyone, pause deployment—or go into high alert, disabling some risky stuff and lowering thresholds for editing/replacing/auditing suspicious model outputs—and use that example of misbehavior to improve safety

Misuse

Anticipate and prevent harm from misuse during external deployment. Do good model evals for dangerous capabilities. Do evals pre-deployment (to inform deployment decisions), have good threat models and categories of evals, use high-quality evals, use good elicitation, publish evals and results, share good access with third-party evaluators. Once models might have dangerous capabilities, deploy such that users can't access those capabilities: deploy via API or similar rather than releasing weights, monitor queries to block suspicious messages and identify suspicious users, maybe do harmlessness/refusal training and adversarial robustness. If a particular deployment plan incurs a substantial risk of catastrophic misuse, deploy more narrowly or cautiously.

Accountability for future deployment decisions

Plan that by the time AIs are capable of tripling the rate of R&D, the lab will do something like this:[*]

Do great risk assessment, especially via model evals for dangerous capabilities. Present risk assessment results and deployment plans to external auditors — perhaps make risk cases and safety cases for each planned deployment, based on threat modeling, model eval results, and planned safety measures. Give auditors more information upon request. Have auditors publish summaries of the situation, how good a job you're doing, the main issues, their uncertainties, and an overall risk estimate.

Plan for AGI

Labs should have a clear, good plan for what to do when they build AGI—i.e., AI that can obsolete top human scientists—beyond just making the AGI safe.[*] For example, one plan is perfectly solve AI safety, then hand things off to some democratic process. This plan is bad because solving AI safety will likely take years after systems are transformative and superintelligence is attainable, and by default someone will build superintelligence without strong reason to believe they can do so safely or do something else unacceptable during that time.

The outline of the best plan I've heard is build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer[*] to them (before building wildly superintelligent AI). Assume it will take 5-10 years after AGI to build such systems and give them sufficient time. To buy time (or: avoid being rushed by other AI projects[*]), inform the US government and convince it to enforce nobody builds wildly superintelligent AI for a while (and likely limit AGI weights to allied projects with excellent security and control). Use AGI to increase the government's willingness to pay for nonproliferation and to decrease the cost of nonproliferation (build carrots to decrease cost of others accepting nonproliferation and build sticks to decrease cost to US of enforcing nonproliferation).[*]

For now, the lab should develop and prepare to implement such a plan. It should also publicly articulate the basic plan, unless doing so would be costly or undermine the plan.

Safety research

Labs should do and share safety research as a public good, to help make powerful AI safer even if it's developed by another lab.[*] Also boost external safety research, especially by giving safety researchers deeper access[*] to externally deployed models.

Policy engagement

Inform policymakers about AI capabilities; when there is better empirical evidence on misalignment and scheming risk, inform policymakers about that; don't get in the way of good policy.

Encourage staff to discuss their views on AI progress, risks, and safety, both internally and publicly (except for confidential information), or at least avoid discouraging staff from doing so. (Facilitating external whistleblowing has benefits but some forms of it are costly, and the benefits are smaller for more responsible labs, so I don't confidently recommend unilateral action for responsible labs.)

Public statements

The lab and its leadership should understand and talk about extreme risks. The lab should clearly describe a worst-case plausible outcome from AI and state its credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax).

If-then planning (meta)

Plan safety interventions as a function of dangerous capabilities, warning signs, etc. Rather than acting based on guesses about which risks will appear, set practices in advance as a function of warning signs. (This should satisfy both risk-worried people who expect warning signs and risk-unworried people who don't.)

I list some less important or more speculative actions (very nonexhaustive) in this footnote.[*] See The Checklist: What Succeeding at AI Safety Will Involve (Bowman 2024) for a collection of goals for a lab. See A Plan for Technical AI Safety with Current Science (Greenblatt 2023) for a detailed but rough, out-of-date, and very high-context plan. See Racing through a minefield: the AI deployment problem (Karnofsky 2022) on the dilemma labs face when deciding how to deploy powerful AI and high-level actions for a lab to improve safety. See Towards best practices in AGI safety and governance: A survey of expert opinion (Schuett et al. 2023) for an old list.

Almost none of this post is original. Thanks to Ryan Greenblatt for inspiring much of it, and thanks to Drake Thomas, Rohin Shah, Alexandra Bates, and Gabriel Mukobi for suggestions. They don't necessarily endorse this post.

Discuss on LessWrong.

AI Lab Watch

Categories

Companies

Resources

Blog

About

What AI companies should do

Security for model weights and code

Control

Misuse

Accountability for future deployment decisions

Plan for AGI

Safety research

Policy engagement

Public statements

If-then planning (meta)

More

AI Lab Watch

Categories

Companies

Resources

Blog

About

Security for model weights and code

Control

Misuse

Accountability for future deployment decisions

Plan for AGI

Safety research

Policy engagement

Let staff share views

Public statements

If-then planning (meta)

More