This post is incomplete. I'm publishing it because it might be helpful for some readers anyway. A good version of this post would be more detailed: for each proposed action, explain motivation and high-level goal, explain problem to solve, respond to obvious questions/objections, and lay out costs involved and costs worth accepting.
These are the actions I think companies that will build very powerful AI systems should prioritize to improve safety. I think something like these actions are cost-effective,
(This post is about responsible labs — asks for non-responsible labs should be more focused on transparency and becoming responsible. This post is about unilateral action — in coordinated action you can accept more costs.)
Security for model weights and code
Extremely strong security will be crucial when AIs can massively accelerate AI R&D. But operating under extreme security—security sufficient to defend against top-priority operations by top state actors—is costly and not yet necessary. So labs should focus on achieving extreme security optionality—the ability to implement extreme security quickly
Control
Anticipate and prevent the risk that AIs will scheme—i.e., act safe and subtly subvert safety measures to gain power—and escape during internal deployment.
- Use monitoring and auditing techniques in deployment
- Use control evals based on the deployment protocol: determine how good the protocol is assuming the model is scheming
- Try to directly test whether the model is scheming (even though this is unreliable): test the model on inputs that might cause it to misbehave in simulated high-stakes deployment settings, and test it in simulated successful escapes
- If the lab catches scheming, take it seriously: tell everyone, pause deployment—or go into high alert, disabling some risky stuff and lowering thresholds for editing/replacing/auditing suspicious model outputs—and use that example of misbehavior to improve safety
Misuse
Anticipate and prevent harm from misuse during external deployment. Do good model evals for dangerous capabilities. Do evals pre-deployment (to inform deployment decisions), have good threat models and categories of evals, use high-quality evals, use good elicitation, publish evals and results, share good access with third-party evaluators. Once models might have dangerous capabilities, deploy such that users can't access those capabilities: deploy via API or similar rather than releasing weights, monitor queries to block suspicious messages and identify suspicious users, maybe do harmlessness/refusal training and adversarial robustness. If a particular deployment plan incurs a substantial risk of catastrophic misuse, deploy more narrowly or cautiously.
Accountability for future deployment decisions
Plan that by the time AIs are capable of tripling the rate of R&D, the lab will do something like this:
Do great risk assessment, especially via model evals for dangerous capabilities. Present risk assessment results and deployment plans to external auditors — perhaps make risk cases and safety cases for each planned deployment, based on threat modeling, model eval results, and planned safety measures. Give auditors more information upon request. Have auditors publish summaries of the situation, how good a job you're doing, the main issues, their uncertainties, and an overall risk estimate.
Plan for AGI
Labs should have a clear, good plan for what to do when they build AGI—i.e., AI that can obsolete top human scientists—beyond just making the AGI safe.
The outline of the best plan I've heard is build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer
For now, the lab should develop and prepare to implement such a plan. It should also publicly articulate the basic plan, unless doing so would be costly or undermine the plan.
Safety research
Labs should do and share safety research as a public good, to help make powerful AI safer even if it's developed by another lab.
Policy engagement
Inform policymakers about AI capabilities; when there is better empirical evidence on misalignment and scheming risk, inform policymakers about that; don't get in the way of good policy.
Let staff share views
Encourage staff to discuss their views on AI progress, risks, and safety, both internally and publicly (except for confidential information), or at least avoid discouraging staff from doing so. (Facilitating external whistleblowing has benefits but some forms of it are costly, and the benefits are smaller for more responsible labs, so I don't confidently recommend unilateral action for responsible labs.)
Public statements
The lab and its leadership should understand and talk about extreme risks. The lab should clearly describe a worst-case plausible outcome from AI and state its credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax).
If-then planning (meta)
Plan safety interventions as a function of dangerous capabilities, warning signs, etc. Rather than acting based on guesses about which risks will appear, set practices in advance as a function of warning signs. (This should satisfy both risk-worried people who expect warning signs and risk-unworried people who don't.)
More
I list some less important or more speculative actions (very nonexhaustive) in this footnote.
Almost none of this post is original. Thanks to Ryan Greenblatt for inspiring much of it, and thanks to Drake Thomas, Rohin Shah, Alexandra Bates, and Gabriel Mukobi for suggestions. They don't necessarily endorse this post.
Discuss on LessWrong.