AI Lab Watch

This page is not exhaustive, especially for companies besides OpenAI and Anthropic. Let me know of missing stuff using the button in the bottom-right.

This page is from an x-risk perspective but includes things not directly relevant to x-risk.

This page may include not just (central) integrity issues but also policy failures and statements that turned out to be very misleading.

Note that public communication is very good but can lead to integrity incidents.

Several companies

Policy advocacy that's more anti-regulation than their public statements would suggest and more anti-regulation in private than in public. Note that private advocacy rarely becomes public. See generally Companies' policy advocacy.
Gaming model benchmarks; maybe presenting new models/products as more powerful than they actually are, especially by making misleading comparisons between models
Not doing ambitious high-integrity stuff
- Not facilitating employees flagging false statements or violated processes
- Not committing to not use non-disparagement agreements, or at least not conceal non-disparagement agreements with non-disclosure agreements
- Not being generally transparent about policies
- Not being generally transparent about strength of security
  - Publicly reporting security incidents to make track record transparent; releasing redacted audit results; releasing redacted pentest results

OpenAI

See also The OpenAI Files.

Non-disparagement + non-disclosure agreements
- Whenever a staff member left OpenAI, OpenAI would ask them to sign a non-disparagement agreement to prevent them from criticizing OpenAI indefinitely.
- And OpenAI would ask them to sign a non-disclosure agreement to prevent them from revealing the existence of the non-disparagement agreement.
- Offering staff members a bonus for doing so would be bad, but OpenAI threatened staff members' vested equity.
- Moreover, OpenAI was deceptive in getting departing staff members to agree — most clearly, it told them they needed to sign within 7 days when they should have gotten 60 (and when challenged on this, OpenAI once audaciously told a departing staff member "The 7 days stated in the General Release supersedes the 60 day signature timeline").
- Plus OpenAI replied to this incident deceptively, on what OpenAI has done in the past (including on clawing back equity vs threatening to do so and suggesting that its control over staff equity was limited to exit documents, not Aestas incorporation documents), what Altman and the leadership knew, and what OpenAI will do in the future (including ambiguity about the various paths OpenAI had to claw back equity or prevent past staff members from selling it).
- Ultimately OpenAI said "We have not and never will take away vested equity, even when people didn't sign the departure documents. We're removing nondisparagement clauses from our standard departure paperwork, and we're releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual."
  - OpenAI has not been explicit about whether it might exclude people from tender events, force them to sell their equity, or otherwise treat them poorly — doing so would look very bad, but the ambiguity discourages criticism.
  - OpenAI does not seem to have released staff members from the silence about contract terms provision of the non-disclosure agreement, so past staff members likely believe they're disallowed from talk about their non-disclosure agreements.
- Sources
  - Vox: 1, 2
  - Kelsey Piper on X: 1, 2, 3, 4, 5
  - CNBC
- Separately, OpenAI seems to have a general culture where staff are discouraged from publicly criticizing it.
Superalignment compute
- OpenAI committed to give a certain amount of compute to its superintelligence alignment effort—and recruited based on this commitment—but denied the Superalignment team's requests for compute and failed to give it a schedule for receiving compute. Ultimately the team's leadership quit because they were "sailing against the wind" and the team was dissolved.
There have been two exoduses of safety staff from OpenAI, partially for integrity-related reasons; see 2024 OpenAI departures
Leopold Aschenbrenner firing
- OpenAI fired Aschenbrenner, claiming that he leaked information. Aschenbrenner alleges that the document he was told he was fired for leaking was "a brainstorming document . . . . shared . . . with three external researchers for feedback . . . . [without] anything sensitive," and doing this was normal, and when he asked OpenAI what was confidential its answer didn't make sense.[*] Additionally, Aschenbrenner alleges that when he shared a memo on OpenAI's security with the OpenAI board, "leadership was very unhappy" and he "got an official HR warning for sharing the memo with the board," and "when I was fired, it was very made explicit that the security memo was a major reason for my being fired. They said, 'the reason this is a firing and not a warning is because of the security memo.'"
Telling board members not to talk to employees
- Altman once told board member Tasha McCauley to tell him if she spoke to employees. McCauley was the board's designated staff liaison and the policy for OpenAI employees to raise concerns involved talking to her, a former OpenAI employee told me.
- (See also the Aschenbrenner allegation.)
Board crisis
- The board fired Altman for lying and being manipulative, both historically and more blatantly in trying to remove Helen Toner from the board.
  - Altman used Toner's coauthored paper Decoding Intentions as a pretext to try to remove her, then lied to other board members in an attempt to remove her (New Yorker, NYT).
  - Altman lied to the board and Murati about safety reviews.
- OpenAI executives told the board that Altman lies a lot.
- An OpenAI lawyer falsely told the board that it had a fiduciary duty to the company, according to Toner.
- Inadequate investigation
  - Investigation was narrow: limited to "the events concerning the November 17, 2023 removal of Sam Altman and Greg Brockman from the OpenAI Board of Directors and Mr. Altman’s termination as CEO."
  - The report was not published.
  - There was no mention of offering confidentiality or using an anonymous reporting mechanism.
- OpenAI took credit for March 2024 internal governance "enhancements" but failed to share details.
Preparedness Framework
- OpenAI takes credit for the PF ensuring safety, but the PF is weak/vague.
- OpenAI said "We think it's important that efforts like ours submit to independent audits before releasing new systems; we will talk about this in more detail later this year." That was in February 2023; it did not elaborate (except to mention that it shared GPT-4 with METR; that was not an audit).
- OpenAI ignored commitments in its original PF, including audits, specialized fine-tuning, and apparently safety drills.
White House voluntary commitments
- Bug bounty: OpenAI committed to establish "bounty systems, contests, or prizes to incent the responsible disclosure of weaknesses, such as unsafe behaviors, or [] include AI systems in their existing bug bounty programs." Their bug bounty program excluded issues with models until 2025.
Whistleblowers allege that OpenAI illegally threatened employees to discourage whistleblowing, including that it "threatened employees with criminal prosecutions if they reported violations of law to federal authorities" and "made staff sign employee agreements that required them to waive their federal rights to whistleblower compensation."
Restructuring away from nonprofit control: lots of deception, including obfuscating the fact that the old situation was that the nonprofit controlled everything. E.g.
Profit cap non-transparency
- In OpenAI's first investment round, profits were capped at 100x. The cap for later investments is negotiated with the investor. (OpenAI LP (OpenAI 2019); archive of original.[*]) In 2021 Altman said the cap was "single digits now" (apparently referring to the cap for new investments, not just the remaining multiplier for first-round investors). In 2023, reportedly the cap was supposed to increase by 20% per year starting in 2025 (The Information 2023; The Economist 2023). Details are unclear now. OpenAI took credit for its capped-profit structure but has not discussed or acknowledged the changes.
Policy advocacy
- OpenAI leadership (Greg Brockman and Chris Lehane) is involved in a pro-AI advocacy network, Leading the Future, that will presumably be deceptive. Lehane and a16z have historically been deceptive in political advocacy. Leading the Future plans to emulate Fairshake and is led by one of the same people; Fairshake is low-integrity.
- OpenAI's direct policy advocacy is sometimes deceptive, such as on SB 53. See generally Companies' policy advocacy.
Internal messaging hack
- OpenAI had a major security incident and didn't report it, even to the government. It took over a year to come out; it seems that it could easily have never come out. This is evidence that there are things-that-look-bad that OpenAI has successfully kept secret.
OpenAI's original[*] "mission is to ensure that artificial general intelligence benefits all of humanity." Sometimes OpenAI and its leadership and staff state that that the mission is to build AGI themselves, including in formal announcements and legal documents.[*]
Some misc safetywashing
- Claiming rapid scaling improves safety by decreasing compute overhang while also trying to expand chip production
- Claiming GPT-4 is "aligned" (e.g. Altman)
Sam Altman personal stuff
- Loopt stuff
  - Most concretely: WSJ: "A group of senior employees at Altman’s first startup, Loopt—a location-based social-media network started in the flip-phone era—twice urged board members to fire him as CEO over what they described as deceptive and chaotic behavior, said people familiar with the matter."
- YC stuff
  - Double standard on conflicts of interest: "He caused tensions after barring other partners at Y Combinator from running their own [VC] funds," despite running one himself.
  - When Altman left YC, he published a blogpost on the YC site announcing that he was moving from president to chairman, but he just decided to do this; YC never made him chairman.
- AltC: in SEC filings, Altman incorrectly claimed to be YC chairman.
- Double standards, in particular Altman forced out board members over conflicts of interest while personally doing worse (most clearly personally investing in companies that then deal with OpenAI).

Microsoft

Deployment safety board
- Microsoft said "When it comes to frontier model deployment, Microsoft and OpenAI have together defined capability thresholds that act as a trigger to review models in advance of their first release or downstream deployment. The scope of a review, through our joint Microsoft-OpenAI Deployment Safety Board (DSB), includes model capability discovery. . . . We have exercised this review process with respect to several frontier models, including GPT-4." This sounds good, but Microsoft has not elaborated on these "capability thresholds," shared details about the DSB, or shared details about past reviews.
- Microsoft takes credit for the DSB without justifying that it's effective (and it seems ineffective)
- Microsoft deployed GPT-4 as a test without telling the DSB, and it initially denied this. See https://x.com/kevinroose/status/1798414599152431278.
Model evals & red-teaming
- Microsoft joined the White House voluntary commitments and takes credit for this. But it doesn't seem to be meeting its commitments on model evals and red-teaming, nor have a plan to meet them, and some of its writing on relevant safety practices is misleading.
  - Microsoft's commitments include "internal and external red-teaming of models or systems in areas including misuse, societal risks, and national security concerns, such as bio, cyber, [autonomous replication,] and other safety areas" and "publish[ing] reports for all new significant model public releases . . . . These reports should include the safety evaluations conducted (including in areas such as dangerous capabilities, to the extent that these are responsible to publicly disclose)" But it seems to do red-teaming just for undesired content, not dangerous capabilities.
Microsoft claimed its Bing Chat deployment as a red-teaming success, but it was in fact a dramatic failure. (This may be an issue of ignorance rather than integrity.)
Microsoft released the weights of its WizardLM-2 model in violation of its policy. (This is not directly an integrity issue, just a significant incident of safety policies being violated plus lack of a great response.)
Misc safetywashing
- Taking more credit for safety than it has earned, e.g. emphasizing unimportant things and not mentioning important things in its AI Safety Policies and Responsible AI Transparency Report.
- Suggesting that it is in compliance with the White House voluntary commitments when it is not (most clearly: it committed to do red-teaming for dangerous capabilities but seems to have merely been doing red-teaming for undesired content).

xAI

xAI's Risk Management Framework is terrible, and it's deceptive since it falsely claims to address risks.
- Misuse: xAI's criterion is very inadequate.
- Loss of control: xAI's plan is confused and its criterion is meaningless.
- Security: xAI claims to have good security but this is not at all credible.
xAI deployed a model violating its RMF (the same week it published the RMF). And it avoided drawing attention to this. And rather than changing the policy it made up some excuses for why the policy shouldn't apply.
RMF delay: xAI published the draft RMF in February. It said "We plan to release an updated version of this policy within three months." The RMF was published six months later.
System prompts: after system prompt mistakes, xAI said it would publish system prompts. It did to some extent but seems to not always update that page when it changes the system prompt.
Datacenter power: xAI apparently lied about datacenter power generation.

Anthropic

Suggesting it wouldn't push the frontier, pushing the frontier, and failing to clarify
- Anthropic once gave some people the impression that it planned to not release models that push the frontier of publicly available models. In particular, CEO Dario Amodei gave Dustin Moskovitz the impression that Anthropic committed "to not meaningfully advance the frontier with a launch" and gave Gwern a similar impression. Anthropic has not publicly clarified this commitment, even after its Claude 3 launch resulted in confusion about whether this commitment ever existed and what its current status is.
Non-disparagement
- Anthropic has offered severance agreements that include a non-disparagement clause and a non-disclosure clause that covers the non-disparagement clause. When this was made public, Anthropic replied that it "recognized that this routine use of non-disparagement agreements, even in these narrow cases, conflicts with [its] mission" and has recently "been going through [its] standard agreements and removing these terms." Moreover: "Anyone who has signed a non-disparagement agreement with Anthropic is free to state that fact (and we regret that some previous agreements were unclear on this point). If someone signed a non-disparagement agreement in the past and wants to raise concerns about safety at Anthropic, we welcome that feedback and will not enforce the non-disparagement agreement." This is good, but continuing to use non-disparagement in non-standard cases (as the reply leaves open) is bad, and Anthropic should "remov[e] these terms" retroactively.
  - Anthropic said "some previous agreements were unclear" on whether the non-disparagement clause was covered by non-disclosure, but this is a misleading understatement, as noted by Habryka. For example, Neel Nanda's severance agreement said "both Parties agree to keep the terms and existence of this agreement and the circumstances leading up to the termination of the Consultant's engagement and the completion of this agreement confidential"; this is quite clear.
Taking credit for the Long-Term Benefit Trust without justifying that it's good, and not explaining what's gone wrong with it
- The LTBT doesn't seem to be functioning well.[*]
- See also Maybe Anthropic's Long-Term Benefit Trust is powerless and Anthropic's Certificate of Incorporation.
A little safetywashing
- Slightly exaggerating interpretability research or causing observers to have excessively optimistic impressions of Anthropic's interpretability research
  - See e.g. Stephen Casper
When the Anthropic founders left OpenAI, they seem to have signed a non-disparagement agreement with OpenAI in exchange for OpenAI doing likewise. The details have not been published.
Anthropic seems to have said inconsistent things to investors vs safety people.
- The frontier-pushing thing
- A 2023 Anthropic pitch deck said "We believe that companies that train the best 2025/26 models will be too far ahead for anyone to catch up in subsequent cycles"; the implicit plan here is inconsistent with the vibe Anthropic gave safety people.

Google & Google DeepMind

They take credit for various safety councils and teams without justifying that those bodies are effective.

AI Lab Watch

Categories

Companies

Resources

Blog

About

Integrity incidents

Several companies

OpenAI

Microsoft

xAI

Anthropic

Meta

Google & Google DeepMind

See also