Integrity incidents/issues/imperfections

This page is not exhaustive, especially for labs besides OpenAI and Anthropic. Let us know of missing stuff.

This page is from an x-risk perspective but includes things not directly relevant to x-risk.

This page may include not just (central) integrity issues but also policy failures and statements that turned out to be very misleading.

This page would include instances of high-integrity-ness but those are weird and hard to notice. Note that public communication is very good but leads to integrity incidents.

Several labs

Policy advocacy that’s more anti-regulation than their public statements would suggest and more anti-regulation in private than in public. Note that private advocacy rarely becomes public. See generally Companies’ policy advocacy.
Gaming model benchmarks; maybe presenting new models/products as more powerful than they actually are, especially by making misleading comparisons between models
Not doing ambitious high-integrity stuff
- Facilitating employees flagging false statements or violated processes
- Committing to not use nondisparagement agreements, or at least not conceal nondisparagement agreements with nondisclosure agreements
- Being generally transparent about policies
- Being generally transparent about strength of security
  - Publicly reporting security incidents to make track record transparent; releasing redacted audit results; releasing redacted pentest results

OpenAI

Nondisparagement + nondisclosure agreements
- Whenever a staff member left OpenAI, OpenAI would ask them to sign a nondisparagement agreement to prevent them from criticizing OpenAI indefinitely.
- And OpenAI would ask them to sign a nondisclosure agreement to prevent them from revealing the existence of the nondisparagement agreement.
- Offering staff members a bonus for doing so would be bad, but OpenAI threatened staff members’ vested equity.
- Moreover, OpenAI was deceptive in getting departing staff members to agree — most clearly, it told them they needed to sign within 7 days when they should have gotten 60 (and when challenged on this, OpenAI once audaciously told a departing staff member “The 7 days stated in the General Release supersedes the 60 day signature timeline”).
- Plus OpenAI replied to this incident deceptively, on what OpenAI has done in the past (including on clawing back equity vs threatening to do so and suggesting that its control over staff equity was limited to exit documents, not Aestas incorporation documents), what Altman and the leadership knew, and what OpenAI will do in the future (including ambiguity about the various paths OpenAI had to claw back equity or prevent past staff members from selling it).
- Ultimately OpenAI said “We have not and never will take away vested equity, even when people didn’t sign the departure documents. We’re removing nondisparagement clauses from our standard departure paperwork, and we’re releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual.”
  - OpenAI has not been explicit about whether it might exclude people from tender events, force them to sell their equity, or otherwise treat them poorly — doing so would look very bad, but the ambiguity discourages criticism.
  - OpenAI does not seem to have released staff members from the silence about contract terms provision of the nondisclosure agreement, so past staff members likely believe they’re disallowed from talk about their nondisclosure agreements.
- Sources
  - Vox: 1, 2
  - Kelsey Piper on X: 1, 2, 3, 4, 5
  - CNBC
- Separately, OpenAI seems to have a general culture where staff are discouraged from publicly criticizing it.
Superalignment compute
- OpenAI committed to give a certain amount of compute to its superintelligence alignment effort—and recruited based on this commitment—but denied the Superalignment team’s requests for compute and failed to give it a schedule for receiving compute. Ultimately the team’s leadership quit because they were “sailing against the wind” and the team was dissolved.
There have been two exoduses of safety staff from OpenAI, partially for integrity-related reasons
- Anthropic exodus
  - Anthropic was founded by former OpenAI staff members. They left OpenAI due to safety concerns. They seem to have signed a two-way nondisparagement agreement with OpenAI; Anthropic and OpenAI decline to share details.
- Spring 2024 exodus
  - OpenAI Whistle-Blowers Describe Reckless and Secretive Culture (NYT 2024)
  - Jan Leike
  - Daniel Kokotajlo
  - Gretchen Krueger
  - William Saunders
  - And several others
  - See also righttowarn.ai
- Geoffrey Irving
Leopold Aschenbrenner firing
- OpenAI fired Aschenbrenner, claiming that he leaked information. Aschenbrenner alleges that the document he was told he was fired for leaking was “a brainstorming document . . . . shared . . . with three external researchers for feedback . . . . [without] anything sensitive,” and doing this was normal, and when he asked OpenAI what was confidential its answer didn’t make sense.¹ Additionally, Aschenbrenner alleges that when he shared a memo on OpenAI’s security with the OpenAI board, “leadership was very unhappy” and he “got an official HR warning for sharing the memo with the board,” and “when I was fired, it was very made explicit that the security memo was a major reason for my being fired. They said, ‘the reason this is a firing and not a warning is because of the security memo.’”
Telling board members not to talk to employees
- Altman once told board member Tasha McCauley to tell him if she spoke to employees. McCauley was the board’s designated staff liaison and the policy for OpenAI employees to raise concerns involved talking to her, a former OpenAI employee told me.
- (See also the Aschenbrenner allegation.)
Board crisis
- The board fired Altman for lying and being manipulative, both historically and more blatantly in trying to remove Helen Toner from the board.
  - Altman used Toner’s coauthored paper Decoding Intentions as a pretext to try to remove her, then lied to other board members in an attempt to remove her (New Yorker, NYT).
- OpenAI executives told the board that Altman lies a lot.
- Stuff during the weekend and around returning [nonexhaustive]
  - Toner said an OpenAI lawyer (incorrectly) told the board that it had a fiduciary duty to the company.
- Inadequate investigation
  - Investigation was narrow: limited to “the events concerning the November 17, 2023 removal of Sam Altman and Greg Brockman from the OpenAI Board of Directors and Mr. Altman’s termination as CEO.”
  - The report was not published.
  - There was no mention of offering confidentiality or using an anonymous reporting mechanism.
- OpenAI took credit for March 2024 internal governance “enhancements” but failed to share details.
Preparedness Framework
- OpenAI takes credit for the PF ensuring safety, but it hasn’t demonstrated that the PF is great.
  - There are various issues within the framework: most notably, capability thresholds are very high, it basically just applies to external deployment, the part about alignment is very unclear, and OpenAI may share dangerous weights with Microsoft.
  - OpenAI said “We think it’s important that efforts like ours submit to independent audits before releasing new systems; we will talk about this in more detail later this year.” That was in February 2023; it did not elaborate (except to mention that it shared GPT-4 with METR; that was not an audit and OpenAI has not done something similar for more recent releases).
White House voluntary commitments
- Bug bounty: OpenAI committed to establish “bounty systems, contests, or prizes to incent the responsible disclosure of weaknesses, such as unsafe behaviors, or [] include AI systems in their existing bug bounty programs.” But their bug bounty program excludes issues with models.
Whistleblowers allege that OpenAI illegally threatened employees to discourage whistleblowing, including that it “threatened employees with criminal prosecutions if they reported violations of law to federal authorities” and “made staff sign employee agreements that required them to waive their federal rights to whistleblower compensation.”
Profit cap non-transparency
- In OpenAI’s first investment round, profits were capped at 100x. The cap for later investments is negotiated with the investor. (OpenAI LP (OpenAI 2019); archive of original.²) In 2021 Altman said the cap was “single digits now” (apparently referring to the cap for new investments, not just the remaining multiplier for first-round investors). But reportedly the cap will increase by 20% per year starting in 2025 (The Information 2023; The Economist 2023). OpenAI took credit for its capped-profit structure but has not discussed or acknowledged the changes.
Internal messaging hack
- OpenAI had a major security incident and didn’t report it, even to the government. It took over a year to come out; it seems that it could easily have never come out. This is evidence that there are things-that-look-bad that OpenAI has successfully kept secret.
OpenAI’s original³ “mission is to ensure that artificial general intelligence benefits all of humanity.” Sometimes OpenAI and its leadership and staff state that that the mission is to build AGI themselves, including in formal announcements and legal documents.⁴
Some misc safetywashing
- Claiming rapid scaling improves safety by decreasing compute overhang while also trying to expand chip production
- Claiming GPT-4 is “aligned” (e.g. Altman)
Sam Altman personal stuff
- Loopt stuff
  - Most concretely: WSJ: “A group of senior employees at Altman’s first startup, Loopt—a location-based social-media network started in the flip-phone era—twice urged board members to fire him as CEO over what they described as deceptive and chaotic behavior, said people familiar with the matter.”
- YC stuff
  - Double standard on conflicts of interest: “He caused tensions after barring other partners at Y Combinator from running their own [VC] funds,” despite running one himself.
  - When Altman left YC, he published a blogpost on the YC site announcing that he was moving from president to chairman, but he just decided to do this; YC never made him chairman.
- AltC: in SEC filings, Altman incorrectly claimed to be YC chairman.
- Double standards, in particular Altman forced out board members over conflicts of interest while personally doing worse (most clearly personally investing in companies that then deal with OpenAI).

Microsoft

Deployment safety board
- Microsoft said “When it comes to frontier model deployment, Microsoft and OpenAI have together defined capability thresholds that act as a trigger to review models in advance of their first release or downstream deployment. The scope of a review, through our joint Microsoft-OpenAI Deployment Safety Board (DSB), includes model capability discovery. . . . We have exercised this review process with respect to several frontier models, including GPT-4.” This sounds good, but Microsoft has not elaborated on these “capability thresholds,” shared details about the DSB, or shared details about past reviews.
- Microsoft takes credit for the DSB without justifying that it’s effective (and it seems ineffective)
- Microsoft deployed GPT-4 as a test without telling the DSB, and it initially denied this. See https://x.com/kevinroose/status/1798414599152431278.
Model evals & red-teaming
- Microsoft joined the White House voluntary commitments and takes credit for this. But it doesn’t seem to be meeting its commitments on model evals and red-teaming, nor have a plan to meet them, and some of its writing on relevant safety practices is misleading.
  - Microsoft’s commitments include “internal and external red-teaming of models or systems in areas including misuse, societal risks, and national security concerns, such as bio, cyber, [autonomous replication,] and other safety areas” and “publish[ing] reports for all new significant model public releases . . . . These reports should include the safety evaluations conducted (including in areas such as dangerous capabilities, to the extent that these are responsible to publicly disclose)” But it seems to do red-teaming just for undesired content, not dangerous capabilities.
Microsoft claimed its Bing Chat deployment as a red-teaming success, but it was in fact a dramatic failure. (This may be an issue of ignorance rather than integrity.)
Microsoft released the weights of its WizardLM-2 model in violation of its policy. (This is not directly an integrity issue, just a significant incident of safety policies being violated plus lack of a great response.)
Misc safetywashing
- Taking more credit for safety than it has earned, e.g. emphasizing unimportant things and not mentioning important things in its AI Safety Policies and Responsible AI Transparency Report.
- Suggesting that it is in compliance with the White House voluntary commitments when it is not (most clearly: it committed to do red-teaming for dangerous capabilities but seems to have merely been doing red-teaming for undesired content).

Anthropic

Suggesting it wouldn’t push the frontier, pushing the frontier, and failing to clarify
- Anthropic once gave some people the impression that it planned to not release models that push the frontier of publicly available models. In particular, CEO Dario Amodei gave Dustin Moskovitz the impression that Anthropic committed “to not meaningfully advance the frontier with a launch” and gave Gwern a similar impression. Anthropic has not publicly clarified this commitment, even after its Claude 3 launch resulted in confusion about whether this commitment ever existed and what its current status is.
- A LessWrong user said “I explicitly asked Anthropic whether they had a policy of not releasing models significantly beyond the state of the art. They said no, and that they believed Claude 3 was noticeably beyond the state of the art at the time of its release.”
Nondisparagement
- Anthropic has offered severance agreements that include a nondisparagement clause and a nondisclosure clause that covers the nondisparagement clause. When this was made public, Anthropic replied that it “recognized that this routine use of non-disparagement agreements, even in these narrow cases, conflicts with [its] mission” and has recently “been going through [its] standard agreements and removing these terms.” Moreover: “Anyone who has signed a non-disparagement agreement with Anthropic is free to state that fact (and we regret that some previous agreements were unclear on this point). If someone signed a non-disparagement agreement in the past and wants to raise concerns about safety at Anthropic, we welcome that feedback and will not enforce the non-disparagement agreement.” This is good, but continuing to use non-disparagement in non-standard cases (as the reply leaves open) is bad, and Anthropic should “remov[e] these terms” retroactively.
  - Anthropic said “some previous agreements were unclear” on whether the nondisparagement clause was covered by nondisclosure, but this is a misleading understatement, as noted by Habryka. For example, Neel Nanda’s severance agreement said “both Parties agree to keep the terms and existence of this agreement and the circumstances leading up to the termination of the Consultant’s engagement and the completion of this agreement confidential”; this is quite clear.
Taking credit for LTBT without justifying that it’s good
- See Maybe Anthropic’s Long-Term Benefit Trust is powerless (Stein-Perlman 2024) and Anthropic’s Certificate of Incorporation (Stein-Perlman 2024).
- Anthropic’s RSP is similar to its LTBT in that it could be great for safety but Anthropic hasn’t demonstrated that — in particular, it hasn’t defined ASL-4. But I’m inclined to excuse this because defining ASL-4 well is quite hard. But for the LTBT, sharing details is very easy.
A little safetywashing
- Slightly exaggerating interpretability research or causing observers to have excessively optimistic impressions of Anthropic’s interpretability research
  - See e.g. Stephen Casper
When the Anthropic founders left OpenAI, they seem to have signed a nondisparagement agreement with OpenAI in exchange for OpenAI doing likewise. The details have not been published.
Anthropic seems to have said inconsistent things to investors vs safety people.
- The frontier-pushing thing
- A 2023 Anthropic pitch deck said “We believe that companies that train the best 2025/26 models will be too far ahead for anyone to catch up in subsequent cycles”; the implicit plan here is inconsistent with the vibe Anthropic gave safety people.

Google & Google DeepMind

They take credit for various safety councils and teams without justifying that those bodies are effective.