Labs should make a plan for aligning powerful systems they create, and they should publish it to elicit feedback, inform others’ plans and research (especially other labs and external alignment researchers who can support or complement their plan), and help them notice and respond to information when their plan needs to change. They should omit dangerous details if those exist. As their understanding of AI risk and safety techniques improves, they should update the plan. Sharing also enables outsiders to evaluate the lab’s attitudes on AI risk/safety.
Frontier labs’ current alignment plans:
- OpenAI: their plan is to automate alignment research and bootstrap to aligning vastly superhuman systems.
- Anthropic: they have a portfolio approach and have discussed specific safety work they do. They don’t have a specific alignment plan, but they’re clearly doing lots of work aimed at the possibility that alignment is very difficult.
- DeepMind: their alignment team has shared some of their thinking, concluding: “Our high-level approach is trying to direct the training process towards aligned AI and away from misaligned AI. There is a lot of alignment work going on at DeepMind, with particularly big bets on scalable oversight, mechanistic interpretability and capability evaluations.” The lab itself hasn’t explained any plan.
- Other labs near the frontier have no plan.
What labs should do
Labs should make a plan for aligning powerful systems they create (and preventing harm from unaligned systems), and they should publish it to elicit feedback, inform others’ plans and research (especially other labs and external alignment researchers who can support or complement their plan), and help them notice and respond to information when their plan needs to change.1 They should omit dangerous details if those exist. As their understanding of AI risk and safety techniques improves, they should update the plan. Sharing also enables outsiders to evaluate the lab’s attitudes on AI risk/safety.
The main downside of sharing is that it incentivizes a lab to say whatever is popular or uncontroversial rather than what is good or what it believes. Beyond talking about a given plan in more popular/uncontroversial ways, pressure-to-share could cause a lab to identify and follow a plan because it is popular/uncontroversial. The actors/groups who labs care about their reputation with may be paying little attention to AI safety and unsophisticated on AI risk, so the plans that are most popular/uncontroversial among those actors/groups are unlikely to be the best plans labs could devise.
Sharing also has the risk of advancing others’ capabilities on dangerous paths, but it seems that can be avoided by fuzzing out sensitive parts.
All of the above applies to not just a safety plan but also intermediate goals or desiderata for safety.
Labs should have a good process for improving their plan. Mostly I don’t know what that should be, but it helps if the plan makes falsifiable predictions.
Perhaps labs should have sharp triggers for actions and decisions. These could be in terms of model capabilities, other variables, or forecasts (labs could have an internal forecasting system or hire forecasters).
Evaluation
We hope to eventually evaluate the content of labs’ alignment plans and perhaps whether the labs are working according to their plans. For now we do not evaluate a plan’s content.
- (3/8) The safety team should share a plan for misalignment, including for the possibility that alignment is very difficult. (Or explain that it believes any specific plan is too specific, but explains its thinking and what it’s working on.)
- (3/8) … and the lab should have a plan, not just its safety team.
- (1/8) … and the lab’s plan should be sufficiently precise that it’s possible to tell whether the lab is working on it, whether it’s succeeding, and whether its assumptions have been falsified.
- (1/8) … and the lab should share its thinking on how it will revise its plan and invite and publish external scrutiny of its plan.
- (3/8) … and the lab should have a plan, not just its safety team.
Sources
Towards best practices in AGI safety and governance (Schuett et al. 2023). “Publish alignment strategy. AGI labs should publish their strategies for ensuring that their systems are safe and aligned.” 49% strongly agree; 41% somewhat agree; 3% neither agree nor disagree; 8% don’t know; n=39 (n=51 but 12 declined to answer).
A challenge for AGI organizations, and a challenge for readers (Bensinger and Yudkowsky 2022).
The current alignment plan, and how we might improve it (Shlegeris 2023).
What labs are doing
Microsoft: no plan.
Google DeepMind: no published official plan, but the DeepMind alignment team has shared some of their thinking. In particular, Some high-level thoughts on the DeepMind alignment team’s strategy (Krakovna and Shah 2023) presents their approach to alignment and how their alignment work fits into their plan. They conclude: “Our high-level approach is trying to direct the training process towards aligned AI and away from misaligned AI. There is a lot of alignment work going on at DeepMind, with particularly big bets on scalable oversight, mechanistic interpretability and capability evaluations.”
See also DeepMind is hiring for the Scalable Alignment and Alignment Teams (Shah and Irving 2022), DeepMind alignment team opinions on AGI ruin arguments (Krakovna 2022), Untitled [comment on DeepMind’s alignment research] (Shah 2022), AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work (Shah et al. 2024), and Rohin Shah on DeepMind and trying to fairly hear out both AI doomers and doubters (80,000 Hours 2023); these sources are unofficial and nonexhaustive. Additionally, the DeepMind safety team has done some plan-adjacent deconfusion research: Clarifying AI X-risk (2022) and Threat Model Literature Review (2022). See also the old Building safe artificial intelligence: specification, robustness, and assurance (DeepMind Safety Research 2018).
Meta AI: no plan.2 Meta AI talks about “Responsible AI,” which among other “pillars” includes “Robustness and safety,” but that discussion is not focused on misalignment-y threat models or on catastrophic-scale risks and includes no plan for them.3
OpenAI: they basically have a plan. The best statement of that plan is How weak-to-strong generalization fits into alignment in Weak-to-strong generalization (OpenAI 2023). Ideally the plan would be complete, concrete, and written in one place; as is, it may not be sharp enough to permit noticing if it’s misguided or not-on-track.
We include some relevant quotes from OpenAI’s past writing on their plans.
Our approach to alignment research (OpenAI 2022):
At a high-level, our approach to alignment research focuses on engineering a scalable training signal for very smart AI systems that is aligned with human intent. It has three main pillars:
- Training AI systems using human feedback
- Training AI systems to assist human evaluation
- Training AI systems to do alignment research
That post discusses some details of these three pillars.
Introducing Superalignment (OpenAI 2023):
Currently, we don’t have a solution for steering or controlling a potentially superintelligent AI, and preventing it from going rogue. Our current techniques for aligning AI, such as reinforcement learning from human feedback, rely on humans’ ability to supervise AI. But humans won’t be able to reliably supervise AI systems much smarter than us, and so our current alignment techniques will not scale to superintelligence. We need new scientific and technical breakthroughs.
Our approach
Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.
To align the first automated alignment researcher, we will need to 1) develop a scalable training method, 2) validate the resulting model, and 3) stress test our entire alignment pipeline:
- To provide a training signal on tasks that are difficult for humans to evaluate, we can leverage AI systems to assist evaluation of other AI systems (scalable oversight). In addition, we want to understand and control how our models generalize our oversight to tasks we can’t supervise (generalization).
- To validate the alignment of our systems, we automate search for problematic behavior (robustness) and problematic internals (automated interpretability).
- Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).
Planning for AGI and beyond (OpenAI 2023):
We will need to develop new alignment techniques as our models become more powerful (and tests to understand when our current techniques are failing). Our plan in the shorter term is to use AI to help humans evaluate the outputs of more complex models and monitor complex systems, and in the longer term to use AI to help us come up with new ideas for better alignment techniques.
Anthropic: Core Views on AI Safety (Anthropic 2023) gives background on their beliefs, presents their portfolio approach, and discusses specific safety work they do. It doesn’t contain a specific alignment plan, but they’re clearly doing lots of work aimed at the possibility that alignment is very difficult.
-
See A challenge for AGI organizations, and a challenge for readers (Bensinger and Yudkowsky 2022):
Having a plan is critically important for an AGI project, not because anyone should expect everything to play out as planned, but because plans force the project to concretely state their crucial assumptions in one place. This provides an opportunity to notice and address inconsistencies, and to notice updates to the plan (and fully propagate those updates to downstream beliefs, strategies, and policies) as new information comes in.
It’s also healthy for the field to be able to debate plans and think about the big picture, and for orgs to be in some sense “competing” to have the most sane and reasonable plan. -
They have not published anything like an alignment plan. We believe they do not have a private plan (or a real alignment team). ↩
-
Responsible AI (Meta AI) and Facebook’s five pillars of Responsible AI (Meta AI 2021). Another pillar is “Transparency and control,” but that refers to products like Facebook rather than transparency and control of AI models. ↩