Integrity incidents/issues/imperfections

This page is not exhaustive, especially for labs besides OpenAI and Anthropic. Let us know of missing stuff.

This page is from an x-risk perspective but includes things not directly relevant to x-risk.

This page may include not just (central) integrity issues but also policy failures and statements that turned out to be very misleading.

This page would include instances of high-integrity-ness but those are weird and hard to notice. Note that public communication is very good but leads to integrity incidents.

Several labs

  • Policy advocacy that’s more anti-regulation than their public statements would suggest and more anti-regulation in private than in public. Note that private advocacy rarely becomes public. See generally Companies’ policy advocacy.
  • Gaming model benchmarks; maybe presenting new models/products as more powerful than they actually are, especially by making misleading comparisons between models
  • Not doing ambitious high-integrity stuff
    • Facilitating employees flagging false statements or violated processes
    • Committing to not use nondisparagement agreements, or at least not conceal nondisparagement agreements with nondisclosure agreements
    • Being generally transparent about policies
    • Being generally transparent about strength of security
      • Publicly reporting security incidents to make track record transparent; releasing redacted audit results; releasing redacted pentest results

OpenAI

  • Nondisparagement + nondisclosure agreements
    • Whenever a staff member left OpenAI, OpenAI would ask them to sign a nondisparagement agreement to prevent them from criticizing OpenAI indefinitely.
    • And OpenAI would ask them to sign a nondisclosure agreement to prevent them from revealing the existence of the nondisparagement agreement.
    • Offering staff members a bonus for doing so would be bad, but OpenAI threatened staff members’ vested equity.
    • Moreover, OpenAI was deceptive in getting departing staff members to agree — most clearly, it told them they needed to sign within 7 days when they should have gotten 60 (and when challenged on this, OpenAI once audaciously told a departing staff member “The 7 days stated in the General Release supersedes the 60 day signature timeline”).
    • Plus OpenAI replied to this incident deceptively, on what OpenAI has done in the past (including on clawing back equity vs threatening to do so and suggesting that its control over staff equity was limited to exit documents, not Aestas incorporation documents), what Altman and the leadership knew, and what OpenAI will do in the future (including ambiguity about the various paths OpenAI had to claw back equity or prevent past staff members from selling it).
    • Ultimately OpenAI said “We have not and never will take away vested equity, even when people didn’t sign the departure documents. We’re removing nondisparagement clauses from our standard departure paperwork, and we’re releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual.”
      • OpenAI has not been explicit about whether it might exclude people from tender events, force them to sell their equity, or otherwise treat them poorly — doing so would look very bad, but the ambiguity discourages criticism.
      • OpenAI does not seem to have released staff members from the silence about contract terms provision of the nondisclosure agreement, so past staff members likely believe they’re disallowed from talk about their nondisclosure agreements.
    • Sources
    • Separately, OpenAI seems to have a general culture where staff are discouraged from publicly criticizing it.
  • Superalignment compute
    • OpenAI committed to give a certain amount of compute to its Superalignment team—and recruited based on this commitment—but denied the team’s requests for compute and failed to give it a schedule for receiving compute. Ultimately the team’s leadership quit because they were “sailing against the wind” and the team was dissolved.
  • There have been two exoduses of safety staff from OpenAI, partially for integrity-related reasons
  • Leopold Aschenbrenner firing
    • OpenAI fired Aschenbrenner, claiming that he leaked information. Aschenbrenner alleges that the document he was told he was fired for leaking was “a brainstorming document . . . . shared . . . with three external researchers for feedback . . . . [without] anything sensitive,” and doing this was normal, and when he asked OpenAI what was confidential its answer didn’t make sense.1 Additionally, Aschenbrenner alleges that when he shared a memo on OpenAI’s security with the OpenAI board, “leadership was very unhappy” and he “got an official HR warning for sharing the memo with the board,” and “when I was fired, it was very made explicit that the security memo was a major reason for my being fired. They said, ‘the reason this is a firing and not a warning is because of the security memo.’”
  • Telling board members not to talk to employees
    • Altman once told board member Tasha McCauley to tell him if she spoke to employees. McCauley was the board’s designated staff liaison and the policy for OpenAI employees to raise concerns involved talking to her, a former OpenAI employee told me.
    • (See also the Aschenbrenner allegation.)
  • Board crisis
    • The board fired Altman for lying and being manipulative, both historically and more blatantly in trying to remove Helen Toner from the board.
      • Altman used Toner’s coauthored paper Decoding Intentions as a pretext to try to remove her, then lied to other board members in an attempt to remove her (New Yorker, NYT).
    • OpenAI executives told the board that Altman lies a lot.
    • Stuff during the weekend and around returning [nonexhaustive]
      • Toner said an OpenAI lawyer (incorrectly) told the board that it had a fiduciary duty to the company.
    • Inadequate investigation
      • Investigation was narrow: limited to “the events concerning the November 17, 2023 removal of Sam Altman and Greg Brockman from the OpenAI Board of Directors and Mr. Altman’s termination as CEO.”
      • The report was not published.
      • There was no mention of offering confidentiality or using an anonymous reporting mechanism.
    • OpenAI took credit for March 2024 internal governance “enhancements” but failed to share details.
  • Preparedness Framework
    • OpenAI takes credit for the PF ensuring safety, but it hasn’t demonstrated that the PF is great.
      • In December 2023 OpenAI “adopted” the PF and in May 2024 it said “We’ve evaluated GPT-4o according to our Preparedness Framework” but still hasn’t implemented the PF and isn’t clear about that and hasn’t published a plan (as of June 2024).
        • The PF says “As a part of our Preparedness Framework, we will maintain a dynamic (i.e., frequently updated) Scorecard that is designed to track our current pre-mitigation model risk across each of the risk categories, as well as the post-mitigation risk.” But the scorecard has not been published.
        • The PF implies that the evals will be published (“we defer specific details on evaluations to the Scorecard section (and this section is intended to be updated frequently)”); evals have not been published.
        • It’s been over 6 months!
        • OpenAI is taking credit for implementing the PF but not meeting its commitments.
      • Various issues within the framework: most notably, capability thresholds are very high, it basically just applies to external deployment, the part about alignment is very unclear, and OpenAI may share dangerous weights with Microsoft.
  • White House voluntary commitments
    • Publishing dangerous-capability evals: OpenAI committed to “publish reports for all new significant model public releases . . . . These reports should include the safety evaluations conducted (including in areas such as dangerous capabilities, to the extent that these are responsible to publicly disclose).” It released GPT-4o and claimed to do dangerous-capability evals, but did not publish them.
    • Bug bounty: OpenAI committed to establish “bounty systems, contests, or prizes to incent the responsible disclosure of weaknesses, such as unsafe behaviors, or [] include AI systems in their existing bug bounty programs.” But their bug bounty program excludes issues with models.
  • Whistleblowers allege that OpenAI illegally threatened employees to discourage whistleblowing, including that it “threatened employees with criminal prosecutions if they reported violations of law to federal authorities.”
  • Profit cap non-transparency
    • In OpenAI’s first investment round, profits were capped at 100x. The cap for later investments is negotiated with the investor. (OpenAI LP (OpenAI 2019); archive of original.2) In 2021 Altman said the cap was “single digits now” (apparently referring to the cap for new investments, not just the remaining multiplier for first-round investors). But reportedly the cap will increase by 20% per year starting in 2025 (The Information 2023; The Economist 2023). OpenAI took credit for its capped-profit structure but has not discussed or acknowledged the changes.
  • Internal messaging hack
    • OpenAI had a major security incident and didn’t report it, even to the government. It took over a year to come out; it seems that it could easily have never come out. This is evidence that there are things-that-look-bad that OpenAI has successfully kept secret.
  • OpenAI’s original3 “mission is to ensure that artificial general intelligence benefits all of humanity.” Sometimes OpenAI and its leadership and staff state that that the mission is to build AGI themselves, including in formal announcements and legal documents.4
  • Some misc safetywashing
    • Claiming rapid scaling improves safety by decreasing compute overhang while also trying to expand chip production
    • Claiming GPT-4 is “aligned” (e.g. Altman)
  • Sam Altman personal stuff
    • Loopt stuff
      • Most concretely: WSJ: “A group of senior employees at Altman’s first startup, Loopt—a location-based social-media network started in the flip-phone era—twice urged board members to fire him as CEO over what they described as deceptive and chaotic behavior, said people familiar with the matter.”
    • YC stuff
      • Double standard on conflicts of interest: “He caused tensions after barring other partners at Y Combinator from running their own [VC] funds,” despite running one himself.
      • When Altman left YC, he published a blogpost on the YC site announcing that he was moving from president to chairman, but he just decided to do this; YC never made him chairman.
    • AltC: in SEC filings, Altman incorrectly claimed to be YC chairman.
    • Double standards, in particular Altman forced out board members over conflicts of interest while personally doing worse (most clearly personally investing in companies that then deal with OpenAI).

Microsoft

  • Deployment safety board
    • Microsoft said “When it comes to frontier model deployment, Microsoft and OpenAI have together defined capability thresholds that act as a trigger to review models in advance of their first release or downstream deployment. The scope of a review, through our joint Microsoft-OpenAI Deployment Safety Board (DSB), includes model capability discovery. . . . We have exercised this review process with respect to several frontier models, including GPT-4.” This sounds good, but Microsoft has not elaborated on these “capability thresholds,” shared details about the DSB, or shared details about past reviews.
    • Microsoft takes credit for the DSB without justifying that it’s effective (and it seems ineffective)
    • Microsoft deployed GPT-4 as a test without telling the DSB, and it initially denied this. See https://x.com/kevinroose/status/1798414599152431278.
  • Model evals & red-teaming
    • Microsoft joined the White House voluntary commitments and takes credit for this. But it doesn’t seem to be meeting its commitments on model evals and red-teaming, nor have a plan to meet them, and some of its writing on relevant safety practices is misleading.
      • Microsoft’s commitments include “internal and external red-teaming of models or systems in areas including misuse, societal risks, and national security concerns, such as bio, cyber, [autonomous replication,] and other safety areas” and “publish[ing] reports for all new significant model public releases . . . . These reports should include the safety evaluations conducted (including in areas such as dangerous capabilities, to the extent that these are responsible to publicly disclose)” But it seems to do red-teaming just for undesired content, not dangerous capabilities.
  • Microsoft claimed its Bing Chat deployment as a red-teaming success, but it was in fact a dramatic failure. (This may be an issue of ignorance rather than integrity.)
  • Microsoft released the weights of its WizardLM-2 model in violation of its policy. (This is not directly an integrity issue, just a significant incident of safety policies being violated plus lack of a great response.)
  • Misc safetywashing

Anthropic

  • Suggesting it wouldn’t push the frontier, pushing the frontier, and failing to clarify
  • Nondisparagement
    • Anthropic has offered severance agreements that include a nondisparagement clause and a nondisclosure clause that covers the nondisparagement clause. When this was made public, Anthropic replied that it “recognized that this routine use of non-disparagement agreements, even in these narrow cases, conflicts with [its] mission” and has recently “been going through [its] standard agreements and removing these terms.” Moreover: “Anyone who has signed a non-disparagement agreement with Anthropic is free to state that fact (and we regret that some previous agreements were unclear on this point). If someone signed a non-disparagement agreement in the past and wants to raise concerns about safety at Anthropic, we welcome that feedback and will not enforce the non-disparagement agreement.” This is good, but continuing to use non-disparagement in non-standard cases (as the reply leaves open) is bad, and Anthropic should “remov[e] these terms” retroactively.
      • Anthropic said “some previous agreements were unclear” on whether the nondisparagement clause was covered by nondisclosure, but this is a misleading understatement, as noted by Habryka. For example, Neel Nanda’s severance agreement said “both Parties agree to keep the terms and existence of this agreement and the circumstances leading up to the termination of the Consultant’s engagement and the completion of this agreement confidential”; this is quite clear.
  • Taking credit for LTBT without justifying that it’s good
  • A little safetywashing
    • Slightly exaggerating interpretability research or causing observers to have excessively optimistic impressions of Anthropic’s interpretability research
  • When the Anthropic founders left OpenAI, they seem to have signed a nondisparagement agreement with OpenAI in exchange for OpenAI doing likewise. The details have not been published.
  • Anthropic seems to have said inconsistent things to investors vs safety people.
    • The frontier-pushing thing
    • A 2023 Anthropic pitch deck said “We believe that companies that train the best 2025/26 models will be too far ahead for anyone to catch up in subsequent cycles”; the implicit plan here is inconsistent with the vibe Anthropic gave safety people.
  • No bug bounty
    • They joined the White House voluntary commitments in July 2023, committing to establish “bounty systems, contests, or prizes to incent the responsible disclosure of weaknesses, such as unsafe behaviors, or [] include AI systems in their existing bug bounty programs.”

Google & Google DeepMind

See also

  1. More context:

    I was pulled aside for a chat with a lawyer that quickly turned adversarial. The questions were about my views on AI progress, on AGI, the appropriate level of security for AGI, whether the government should be involved in AGI, whether I and the superalignment team were loyal to the company, and what I was up to during the OpenAI board events. They then talked to a couple of my colleagues and came back and told me I was fired. They’d gone through all of my digital artifacts from my time at OpenAI, and that’s when they found the leak.

    The main claim they made was this leaking allegation. That’s what they told employees. The security memo was another thing. There were a couple of other allegations they threw in. One thing they said was that I was unforthcoming during the investigation because I didn’t initially remember who I had shared the preparedness brainstorming document with, only that I had talked to some external researchers about these ideas.

    The document was over six months old, I’d spent a day on it. It was a Google Doc I shared with my OpenAI email. It wasn’t a screenshot or anything I was trying to hide. It simply didn’t stick because it was such a non-issue. They also claimed I was engaging on policy in a way they didn’t like. They cited there that I had spoken to a couple of external researchers, including someone at a think tank, about my view that AGI would become a government project, as we just discussed.

    In fact, I was speaking with lots of people in the field about that view at the time. I thought it was a really important thing to think about. So they found a DM I had written to a friendly colleague, five or six months earlier, and they cited that too. I had thought it was well within OpenAI norms to discuss high-level issues about the future of AGI with external people in the field.

  2. economic returns for investors and employees are capped (with the cap negotiated in advance on a per-limited partner basis). Any excess returns go to OpenAI Nonprofit. Our goal is to ensure that most of the value (monetary or otherwise) we create if successful benefits everyone, so we think this is an important first step. Returns for our first round of investors are capped at 100x their investment (commensurate with the risks in front of us), and we expect this multiple to be lower for future rounds as we make further progress.

  3. Archive of Charter

  4. Also offhandedly, e.g. Anna Makanju