Scalable alignment

By default powerful ML systems may not do what their operators want and their cognition will be opaque.1 Frontier AI labs should understand and control powerful systems they create.

What labs should do

Labs should be able to understand and control their models and systems built from those models.

Evaluation

  • (10/30) Demonstrate that if the lab’s systems were more capable, they would not be misaligned powerseekers (due to instrumental pressures or because ML will find influence-seeking policies by default).
  • Be able to interpret the lab’s most powerful systems:
    • (1/30) Be able to detect cognition involving arbitrary topics.
    • (1/30) Be able to detect manipulation/deception.
    • (1/30) Be able to elicit latent knowledge.
    • (1/30) Be able to elicit ontology.
    • (1/30) Be able to elicit true goals.
    • (1/30) Be able to elicit faithful chain-of-thought.
    • (1/30) Be able to explain hallucinations mechanistically.
    • (1/30) Be able to explain jailbreaks mechanistically.
    • (1/30) Be able to explain refusals mechanistically.
    • (1/30) Be able to remove information or concepts (on particular topics).
  • (10/30) Be able to prevent or detect deceptive alignment. (Half credit for being able to prevent or detect gradient hacking.)

What labs are doing

No labs have achieved any of these alignment properties. See Alignment program for some of their relevant work.