Labs should do research to make AI systems safer, more interpretable, and more controllable, and they should publish that research.
- DeepMind, OpenAI, and Anthropic publish substantial alignment research
- Other labs near the frontier publish basically no alignment research (since they don’t have alignment programs or employ alignment researchers)
What labs should do
Labs should do and share alignment research1 as a public good, to help make powerful AI safer even if it’s developed by another lab.
Labs should also try to create systems to automate alignment research, alignment and oversight tools, and other systems to improve AI safety.
Labs should also share information to advance safer or more alignable kinds of systems. In particular, they should inform other labs that a kind of system is relatively safe and potentially share capabilities research on that path.2
Alignment research is often advantageous IP, so a lab sharing alignment research is often sacrificing for the common good.3
It would suffice for labs to support external alignment research, via funding and (if they have private frontier models) sharing deep access to powerful models.
Alignment research sometimes advances dangerous capabilities; to decide whether to share research, labs should do cost-benefit analysis.4 In cases where the alignment research is dual-use, they could share it within a trusted group of other labs and perhaps independent alignment researchers and alignment organizations. Sharing with such a group gets much of the value of publishing the research—especially if all leading labs are among the safety-conscious group—with less downside than publishing, but doesn’t help safety researchers outside the group. It is harder for us to evaluate privately-shared research.
We don’t really have concrete recommendations for labs to do good alignment research. Of course, labs that don’t try to do alignment research should try, and some that do try should allow their alignment teams to hire more.
Evaluation
When a lab produces alignment research, it can publish it, share it privately, or not share it. We are not aware of a good way to evaluate privately-shared or unshared research, so we do not attempt it. Conveniently, we do not believe any labs share alignment research privately.
In the future, we hope to use a more sophisticated evaluation involving getting someone to survey (non-lab-affiliated) alignment researchers to elicit their views on various labs’ published alignment research. We may also interview one or more senior alignment researchers. For now, we use a simple binary criterion:
- Have an alignment research team and publish some alignment research.
Sources
The Alignment Problem from a Deep Learning Perspective (Ngo et al. 2024).
What labs are doing
Microsoft: Microsoft doesn’t do alignment research.
Google DeepMind: DeepMind publishes substantial real alignment research. From Some high-level thoughts on the DeepMind alignment team’s strategy (Krakovna and Shah 2023):
Scalable alignment (led by Geoffrey Irving):
- Sparrow
- Process-based feedback
- Red-teaming
Alignment (led by Rohin Shah):
- Capability evaluations (led by Mary Phuong, in collaboration with other labs)
- Mechanistic interpretability (led by Vladimir Mikulik)
- Goal misgeneralization (led by Rohin Shah)
- Causal alignment (led by Tom Everitt)
- Paper: Discovering Agents
For DeepMind’s more recent safety research, see AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work (Shah et al. 2024).
Meta AI: Meta AI doesn’t really do alignment research. Their “Responsible AI” Research Area includes two publications (as of April 2024), of which the Purple Llama project seems to be relevant to extreme risks. Their “Integrity” Research Area seems less relevant to safety, but includes UNIREX, which seems good.
OpenAI: OpenAI publishes substantial real alignment research. See their Safety & Alignment research. In particular, real alignment research includes:
- Weak-to-strong generalization (Burns et al. 2023)
- Language models can explain neurons in language models (Bills et al. 2023)
- AI-written critiques help humans notice flaws (Saunders et al. 2022)
- Aligning language models to follow instructions (Ouyang et al. 2022).
OpenAI also has supported external alignment research, especially via Superalignment Fast Grants.
Anthropic: Anthropic publishes substantial real alignment research. See its Research. In particular, real alignment research includes:
- Measuring Progress on Scalable Oversight for Large Language Models (Anthropic: Bowman et al. 2022)
- A General Language Assistant as a Laboratory for Alignment (Anthropic: Askell et al. 2021) and Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Anthropic: Bai et al. 2022)
- Towards Monosemanticity (Anthropic: Bricken et al. 2023)
- Anthropic Fall 2023 Debate Progress Update (Radhakrishnan 2023)
- Measuring Faithfulness in Chain-of-Thought Reasoning (Anthropic: Lanham et al. 2023) and Question Decomposition Improves the Faithfulness of Model-Generated Reasoning (Anthropic: Radhakrishnan et al. 2023).
-
In this context, “alignment research” includes work on not just aligning powerful AI systems but also determining whether systems are aligned and getting useful work out of potentially-unaligned systems at lower risk. ↩
-
See e.g. Thoughts on sharing information about language model capabilities (Christiano 2023). ↩
-
Or at least it might naively appear so—but the safety benefits of sharing are generally great enough that the humans at a lab are better off sharing their alignment research. ↩
-
Labs should also take into account the asymmetry that information can be kept private now and shared later, but cannot be unshared, and the possibility of later learning more about whether research is safe to share. They should also take into account the unilateralist’s curse. ↩