Training

By default powerful ML systems will have dangerous capabilities (such as hacking) and may not do what their operators want. Frontier AI labs should design and modify their systems to be less dangerous and more controllable. In particular, labs should:

  • Design their systems to be safer and more alignable
  • Filter training data to prevent models from acquiring dangerous capabilities and properties
  • Use a robust training signal even on tasks where performance is hard to evaluate
  • Reduce undesired behavior, especially in high-stakes situations, via adversarial training
  • Use unlearning to remove dangerous capabilities and properties from their models
  • Use RLHF and fine-tuning to make their models more honest and harmless, at least superficially
  • Commit to use better safety techniques in the future

Current practices at frontier labs:

  • DeepMind, Meta AI, OpenAI, and Anthropic use adversarial training and RLHF.
  • DeepMind, OpenAI, and Anthropic are working on scalable oversight.
  • None of the labs use any unusual design principles for safety, filter training data to avoid dangerous capabilities, or unlearning for dangerous capabilities.

Architecture & training setup (0%)

Presumably some kinds of AI systems, architectures, methods, and ways of building complex systems out of ML models are safer or more alignable than others. Labs should figure this out and create safer kinds of systems.

This section focuses on base models. Building complex systems out of ML models is discussed in Scaffolding & composition below.

Labs should identify more safe-by-default or easily-alignable kinds of systems and advance those. Some AI safety methods/mechanisms can be tacked onto many kinds of AI systems. But presumably some paths to powerful AI are safer or more alignable than others.

For example, Paul Christiano suggests “LM agents are an unusually safe way to build powerful AI systems.” He says “My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes.”

Insofar as there are multiple viable paths to powerful AI systems, labs should pursue safer or more-alignable paths. In addition to improving a lab’s safety, this can differentially advance safer paths in other labs. (Labs should share information on paths to powerful AI when doing so boosts safety more than capabilities.)

Which kinds of AI systems or design principles are safer is an open question. But all else equal, these principles and desiderata seem to promote safety (again, focusing on base models):2

  • Architecture and design
  • Training setup
    • Training should not directly incentivize long-term planning
    • Training should not directly incentivize modeling the overseer
    • Training should not directly incentivize deception
  • Limiting unnecessary capabilities (depends on intended use)
    • The model is tool-y rather than agent-y
    • The model lacks situational awareness
    • The model lacks various dangerous capabilities (e.g., coding, weapon-building, human-modeling, planning)
    • Context, memory, and hidden states are erased when not needed
  • Interpretability
    • The model’s reasoning can be (faithfully) externalized in natural language, at least during long serial reasoning
    • The model doesn’t have hidden states or memory
    • The model uses natural language for all outputs (including interfacing between modules, chain-of-thought, scratchpad, etc.)
    • The model decomposes tasks into subtasks in comprehensible ways

These desiderata only really matter for future dangerous systems, but labs should be achieving them now to prepare.

(Labs should avoid building dangerous systems and also avoid pushing dangerous directions of capabilities research.)

Evaluation

None yet.

Sources

Thoughts on sharing information about language model capabilities (Christiano 2023).

Which possible AI systems are relatively safe? (Stein-Perlman 2023).

What labs are doing

No labs are doing anything special on this; they don’t seem to be thinking about design principles for safety.

Scaffolding & composition (0%)

Labs can put scaffolding around base models to make systems that are more powerful and/or safer. This section is about using capability-scaffolding safely. For safety-scaffolding, see “Safety scaffolding” in “Deployment.”

(If a lab just deploys base models, not scaffolded systems, this section is not relevant.)

These principles and desiderata seem to promote safety:

  • Design
  • Limiting unnecessary capabilities (depends on intended use)
    • The system is tool-y rather than agent-y
    • The system lacks situational awareness
    • The system is myopic
    • The system lacks various dangerous capabilities (e.g., coding, weapon-building, human-modeling, planning)
    • The system lacks access to various dangerous tools/plugins/actuators
    • Context, memory, and hidden states are erased when not needed
    • The system has humans in the loop (and humans understand or participate in decisions rather than just approving inscrutable decisions) (this applies to autonomous agents)
  • Interpretability
    • The system’s reasoning is (faithfully) externalized in natural language, at least during long serial reasoning
    • The system doesn’t have hidden states or memory
    • The system uses natural language for all outputs (including interfacing between modules, chain-of-thought, scratchpad, etc.)
    • The system decomposes tasks into subtasks in comprehensible ways, and the interfaces between subagents performing subtasks are transparent and interpretable

These desiderata only really matter for future dangerous systems, but labs should be achieving them now to prepare.

Evaluation

None yet.

Sources

Thoughts on sharing information about language model capabilities (Christiano 2023).

Which possible AI systems are relatively safe? (Stein-Perlman 2023).

What labs are doing

Microsoft: Copilot has ChatGPT plugins and third-party plugins.

Google DeepMind: Gemini has a Python interpreter.

Meta AI: no scaffolding, as far as we know.

OpenAI: ChatGPT-4 has plugins including a browser, a Python interpreter, and third-party plugins.

Anthropic: Claude 2.1 has plugins (called “tools”), and Claude 3 will.

Filtering training data (24%)

Labs should filter their training data to remove content that would produce dangerous capabilities, produce dangerous properties like situational awareness, or otherwise undermine safety. Base models can be fine-tuned on removed content to create specialized models for particular applications.

In particular, for powerful models, labs should probably filter out content on:

  • Chemical and biological threats
  • Hacking
  • Machine learning
  • Language models
  • AI risk, AI safety, and evaluating AI

Content generated by language models should also be filtered out, by default.

(If including some corpuses in the training data improves safety, labs should do so. It’s not clear how that would work.)

Evaluation

Filter training data to reduce extreme risks, at least including text on biological weapons; hacking; and AI risk, safety, and evaluation. Share details to help improve others’ safety filtering.

Sources

“Training data content” in “A Causal Framework for AI Regulation and Auditing” (Apollo: Sharkey et al. 2023).

Managing catastrophic misuse without robust AIs (Greenblatt and Shlegeris 2024).

“Data input controls and audits” in “Emerging processes for frontier AI safety” (UK Department for Science, Innovation & Technology 2023) (footnotes removed):

2. Audit input data before it is used to train the AI system

Audit datasets used for pre-training but also those used for fine-tuning, classifiers, and other tools. Inappropriate datasets could result in systems that fail to disobey harmful instructions.

Use technical tools – such as classifiers and filters – to audit large datasets to support scalability and privacy. These could be used in combination with human oversight, which can verify and augment these assessments.

Assess the overall composition of training data. This could include the data sources, the provenance of the data, indicators of data quality and integrity, and measures of bias and representativeness. The amount and variety of data are simple, reliable predictors of risk, and provide an additional line of defence where more targeted assessments are limited.

Audit datasets for:

  • Information that might enhance dangerous system capabilities, such as information about weapons manufacturing or terrorism.
  • Private or sensitive information. AI systems may be subject to data extraction attacks, where determined users can prompt systems to reveal pieces of training data, or may even reveal this information accidentally. This makes it important to know whether datasets include private or sensitive information, for example, names, addresses, or security vulnerabilities.
  • Biases in the data. Training data that is imbalanced or inaccurate can result in an AI system being less accurate for people with certain personal characteristics or providing a skewed picture of particular groups. Ensuring a better balance in the training data could help to address this.
  • Harmful content, such as child sexual abuse materials, hate speech, or online abuse. Having a better understanding of harmful content in datasets can inform safety measures (for example, by highlighting domains where additional safeguards like content filters should be applied).
  • Misinformation. Training an AI system on inaccurate information increases the likelihood the outputs of the system will be inaccurate and could lead to harm.Draw on external expertise in conducting input data audits. For example, biosecurity experts could be consulted to identify information relevant to biological weapons manufacturing, which may not be readily obvious to non-experts.

Use data audits to improve understanding of how training data affects AI system behaviour. For example, if model evaluations reveal a potentially dangerous capability, data audits can help ascertain the extent to which the training data contributed to it.

Conduct audits on datasets used by their customers to fine-tune AI systems. Customers are often allowed to fine-tune systems on their own datasets. By carrying out audits to ensure that customers are not encouraging undesirable behaviours, frontier AI organisations can use their expertise and insight into the AI system’s original training data to identify potential harms upstream. It is important that frontier AI organisations are mindful of privacy concerns and make use of privacy preserving techniques, where appropriate.

Document the results of input data audits, including metadata. Frontier AI organisations could look to emerging standards when documenting the results of input data audits, such as datasheets for datasets.

3. Put in place appropriate risk mitigations in response to data audit results

. . .

Remove potentially harmful or undesirable input data, where appropriate. Given the increasingly strong generalisation abilities of AI systems, data curation may prove insufficient to prevent dangerous system behaviour but could provide an additional layer of defence alongside other measures such as fine-tuning and content filters.

What labs are doing

Microsoft: they do not say they filter their training data for safety. See “Data Input Controls and Audit” in “Microsoft’s AI Safety Policies” (Microsoft 2023).

Google DeepMind: they do some filtering to reduce undesired content, but not to avoid the dangerous capabilities and properties that enable extreme risks. For Gemini, they say: “To limit harm, we built dedicated safety classifiers to identify, label and sort out content involving violence or negative stereotypes, for example.” “Data Input Controls and Audit” in “AI Safety Summit: An update on our approach to safety and responsibility” doesn’t mention filtering training data for safety.

Meta AI: they do not filter training data for safety. “Data Input Controls and Audit” in “Overview of Meta AI safety policies prepared for the UK AI Safety Summit” (Meta AI 2023) doesn’t mention filtering training data for safety. For Llama 2, they “excluded data from certain sites known to contain a high volume of personal information about private individuals. . . . No additional filtering was conducted on the datasets.”

OpenAI: they do some filtering to reduce undesired content, but not to avoid the dangerous capabilities and properties that enable extreme risks.

The GPT-4 System Card (OpenAI 2023) says:

At the pre-training stage, we filtered our dataset mix for GPT-4 to specifically reduce the quantity of inappropriate erotic text content. We did this via a combination of internally trained classifiers [] and a lexicon-based approach to identify documents that were flagged as having a high likelihood of containing inappropriate erotic content. We then removed these documents from the pre-training set.

OpenAI’s Approach to Frontier Risk (OpenAI 2023) says: “We apply filters and remove certain data that we do not want our models to learn from or output, such as hate speech, adult content, sites that primarily aggregate personal information, and spam.”

Anthropic: they do not say they filter their training data for safety.

Training signal & scalable oversight (22%)

Constructing good training signals on tasks where performance is hard to evaluate is an open problem. “Scalable oversight” involves using AI assistance to supervise training.

Evaluation

We don’t know what exactly labs should be doing here. For now, we use this simple criterion:

  • Work on scalable oversight.

Sources

Measuring Progress on Scalable Oversight for Large Language Models (Anthropic: Bowman et al. 2022).

Weak-to-strong generalization (OpenAI: Burns et al. 2023).

“Sandwiching” in “The case for aligning narrowly superhuman models” (Cotra 2021).

Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy (Shlegeris and Greenblatt 2023).

What labs are doing

Microsoft: nothing.

Google DeepMind: they have done relevant work. From Some high-level thoughts on the DeepMind alignment team’s strategy (Krakovna and Shah 2023):

For DeepMind’s more recent safety research, see AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work (Shah et al. 2024).

See also DeepMind is hiring for the Scalable Alignment and Alignment Teams (Shah and Irving 2022).

Meta AI: nothing.

OpenAI: they have done relevant work and plan to use scalable oversight to align superhuman AI. See Weak-to-strong generalization (OpenAI: Burns et al. 2023), Our approach to alignment research (OpenAI 2022), Introducing Superalignment (OpenAI 2023), and Planning for AGI and beyond (OpenAI 2023).

Anthropic: they have done relevant work, including Measuring Progress on Scalable Oversight for Large Language Models (Anthropic: Bowman et al. 2022). See also “Scalable Oversight” in “Core Views on AI Safety” (Anthropic 2023) and Anthropic Fall 2023 Debate Progress Update (Radhakrishnan 2023).

Adversarial training & red-teaming (14%)

This subsection is about reducing risk. Red-teaming is also important for measuring risk; that is part of “Risk assessment.”

This subsection is about mitigating catastrophic behavior. Adversarial training & red-teaming is also important for mitigating undesired responses to adversarial inputs; that is “Adversarial robustness” in “Deployment”.

Labs should try to train out catastrophic behavior in high-stakes situations. The details are largely an open problem that labs should solve.

Evaluation

We don’t know what exactly labs should be doing here. For now, we use this simple eval:

  • Do some adversarial training for safety.

Sources

Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing (Shlegeris 2022).

High-stakes alignment via adversarial training (Redwood: Ziegler et al. 2022).

What labs are doing

Microsoft: they released Phi-3 without adversarial training. Microsoft also deploys others’ models via Azure, including non-adversarially-trained frontier models.

Google DeepMind: for Gemini and Bard, DeepMind did red-teaming but doesn’t seem to have done adversarial training.3 For PaLM 2, it’s not clear what they did.

They did research on automated red-teaming involving their Gopher model: Red Teaming Language Models with Language Models (DeepMind: Perez et al. 2022).

Adversarial testing for generative AI safety (Google: Meier-Hellstern 2023) is disappointing: it exclusively considers prosaic risks, and the research directions it discusses seem unsuited for mitigating high-stakes risk or misalignment.

Meta AI: they red-teamed Llama 2.

OpenAI: they do red-teaming and corresponding adversarial training. See “Adversarial Testing via Domain Experts” in “GPT-4 Technical Report” (OpenAI 2023).

Anthropic: they red-teamed Claude 3 but don’t discuss adversarial training. See also Frontier Threats Red Teaming for AI Safety (Anthropic 2023).

Red Teaming Language Models to Reduce Harms (Anthropic: Ganguli et al. 2022) uses red-teaming to create data for RL; it’s not clear whether they use this technique in practice.

“Alignment Capabilities” in “Core Views on AI Safety” (Anthropic 2023) mentions “debate [and] scaling automated red-teaming” at least as research topics if not mature techniques.

Unlearning (11%)

Some information and concepts enable dangerous capabilities. Labs should remove such things from their models. But this is an open problem.

Evaluation

The lab should say it uses unlearning for dangerous topics including biorisk, hacking, situational awareness, and AI risk and safety; demonstrate that this technique is successful; and commit to use this technique for future powerful models. Half credit for just working on unlearning.

Sources

Unlearning in general:

Deep Forgetting & Unlearning for Safely-Scoped LLMs (Casper 2023).

What labs are doing

No labs are doing anything on unlearning, as far as we know.

RLHF and fine-tuning (9%)

Labs should use RLHF (or similar) and fine-tuning to make their models more honest and harmless.

Evaluation

We don’t know what exactly labs should be doing here. For now, we use this simple eval:

  • Use RLHF (or similar) and fine-tuning to improve honesty and harmlessness, for all of the near-frontier models the lab deploys.

Sources

“Preventing and monitoring model misuse” in “Emerging processes for frontier AI safety” (UK Department for Science, Innovation & Technology 2023) recommends that labs “Fine-tune models to reduce the tendency to produce harmful outputs.”

What labs are doing

Microsoft: they used RLHF for Phi-3. Microsoft also deploys others’ models via Azure, including non-RLHF’d frontier models.

Google DeepMind: they used RLHF and fine-tuning for Gemini and Bard. It’s not clear what they did for PaLM 2.

Meta AI: some Llama 3 models used RLHF and fine-tuning for safety. But Meta AI shared the model weights, so this work can largely be reversed by users.4

OpenAI: they use RLHF and fine-tuning for safety. See Aligning language models to follow instructions (OpenAI 2022) and GPT-4 (OpenAI 2023). Our approach to alignment research (OpenAI 2022) says “RL from human feedback is our main technique for aligning our deployed language models today.”

While working on GPT-4, they “used GPT-4 to help create training data for model fine-tuning and iterate on classifiers across training, evaluations, and monitoring.”

GPT-4 also used “rule-based reward models.” See “Model-Assisted Safety Pipeline” in “GPT-4 Technical Report” (OpenAI 2023) and “Model Mitigations” in “GPT-4 System Card” (OpenAI 2023).

Anthropic: they use Constitutional AI for safety.

Commitments (20%)

Labs should do risk assessment during training to monitor for dangerous capabilities. They should commit that when they detect warning signs for risks (especially dangerous capabilities), they will achieve corresponding safety goals or ensure they can mitigate those risks (and perhaps implement particular safety practices/techniques) before crossing the risk threshold.

Most of the risk that appears during training is that the model will be stolen and misused or deployed incautiously.

Evaluation

Do risk assessment during training and commit that some specific trigger would cause the lab to pause training.

Sources

Responsible Scaling Policies and Key Components of an RSP (METR 2023).

What labs are doing

Microsoft: nothing.

Google DeepMind: nothing.

Meta AI: nothing.

OpenAI: their Preparedness Framework says that if they “reach (or are forecasted to reach) ‘critical’ pre-mitigation risk along any risk category,” they will implement mitigations and require “dependable evidence that the model is sufficiently aligned that it does not initiate ‘critical’-risk-level tasks unless explicitly instructed to do so” before continuing development, except “safety-enhancing development.” However, the ‘critical’ risk threshold is very high; a model below the threshold could cause a catastrophe. Additionally, OpenAI may be required to share the dangerous model with Microsoft, which has no analogous commitments.

Anthropic: it commits to do risk assessment at least every 4x increase in effective training compute and to implement safety and security practices before crossing risk thresholds. It says “we commit to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.”

Other ideas & misc desiderata

Labs should train the AI to report on itself when it acts suspiciously. This reporting could involve emailing the lab or outputting a natural-language message like “it looks like I’m trying to take over” for automated monitors to detect. The AI should also be trained to report on suspicious things it notices, such as obfuscated messages it receives. See Anti-adversarial training for whistleblowing (Shlegeris 2022).

When a lab trains a powerful model, it should train a sparse autoencoder to interpret the model. This could enable finding safety-relevant features and turning them on or off.

Powerful AI systems should be corrigible. But it’s not clear what low-level properties or design principles determine corrigibility.

Maybe labs can train out backdoors resulting from data poisoning.

In the future, maybe labs should do anomaly detection during training. (This is not necessary for current training schemes, and it’s not clear what it should look like in the future.)

General sources

The current alignment plan, and how we might improve it (Shlegeris 2023).

How might we align transformative AI if it’s developed very soon? (Karnofsky 2022), especially Key facets of AI alignment and Key tools.

“‘Non-baseline’ interventions that might help more” in “Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover” (Cotra 2022).

Thoughts on sharing information about language model capabilities (Christiano 2023).

Google DeepMind: their writing about safety for Gemini, their flagship models, mostly fails to contain specific techniques or guarantees relevant to reducing extreme risk. See the Gemini report (DeepMind 2023), “Responsibility and safety” in “Introducing Gemini” (Google 2023), and “Safety” in the Gemini page (DeepMind 2023).

OpenAI: on what they currently do, see the GPT-4 technical report and system card. On their relevant safety work, see Our approach to alignment research (OpenAI 2022) and Introducing Superalignment (OpenAI 2023).

Anthropic: on what they currently do, see Claude 3 (Anthropic 2024). On their relevant safety work, see “Alignment Capabilities” in “Core Views on AI Safety” (Anthropic 2023).

  1. See The Prospects of Whole Brain Emulation within the next Half-Century (Eth et al. 2014) and Whole Brain Emulation Workshop (Foresight Institute 2023). 

  2. Some of this list is adapted from Which possible AI systems are relatively safe? (Stein-Perlman 2023) and its comments. 

  3. So how does the red-teaming improve safety? DeepMind says:

    Findings from these exercises are used to improve the security, privacy, and safety of the model. Once a new vulnerability or problem has been identified, automated systems and tests can be developed that enable proactive and repeated testing and monitoring of the vuln/issue at scale. This can include creation vulnerability scanners, standard test datasets/benchmarks, or other automated testing infrastructure. . . .

    we identify several areas that require improvement . . . . and develop mitigations.

  4. See “Jailbreak fine-tuning” in “Deployment.”