Erich_Grunewald comments on Ilya Sutskever and Jan Leike resign from OpenAI [updated]

Erich_Grunewald 18 May 2024 20:46 UTC
6 points
2
Everything that happened since then has made it clear that this is not the case; that all these big flashy commitments like Superalignment were just safety-washing and virtue signaling. They were only going to do alignment work inasmuch as that didn’t interfere with racing full-speed towards greater capabilities.
It’s not clear to me that it was just safety-washing and virtue signaling. I think a better model is something like: there are competing factions within OAI that have different views, that have different interests, and that, as a result, prioritize scaling/productization/safety/etc. to varying degrees. Superalignment likely happened because (a) the safety faction (Ilya/Jan/etc.) wanted it, and (b) the Sam faction also wanted it, or tolerated it, or agreed to it due to perceived PR benefits (safety-washing), or let it happen as a result of internal negotiation/compromise, or something else, or some combination of these things.
If OAI as a whole was really only doing anything safety-adjacent for pure PR or virtue signaling reasons, I think its activities would have looked pretty different. For one, it probably would have focused much more on appeasing policymakers than on appeasing the median LessWrong user. (The typical policymaker doesn’t care about the superalignment effort, and likely hasn’t even heard of it.) It would also not be publishing niche (and good!) policy/governance research. Instead, it would probably spend that money on actual PR (e.g., marketing campaigns) and lobbying.
I do think OAI has been tending more in that direction (that is, in the direction of safety-washing, and/or in the direction of just doing less safety stuff period). But it doesn’t seem to me like it was predestined. I.e., I don’t think it was “only going to do alignment work inasmuch as that didn’t interfere with racing full-speed towards greater capabilities”. Rather, it looks to me like things have tended that way as a result of external incentives (e.g., looming profit, Microsoft) and internal politics (in particular, the safety faction losing power). Things could have gone quite differently, especially if the board battle had turned out differently. Things could still change, the trend could still reverse, even though that seems improbable right now.
- Thane Ruthenis 18 May 2024 22:10 UTC
  9 points
  3
  Parent
  Superalignment likely happened because (a) the safety faction (Ilya/Jan/etc.) wanted it, and (b) the Sam faction also wanted it, or tolerated it, or agreed to it due to perceived PR benefits (safety-washing), or let it happen as a result of internal negotiation/compromise, or something else, or some combination of these things.
  Sure, that’s basically my model as well. But if the faction (b) only cares about alignment due to perceived PR benefits or in order to appease faction (a), and faction (b) turns out to have overriding power such that it can destroy or drive out faction (a) and then curtail all the alignment efforts, I think it’s fair to compress all that into “OpenAI’s alignment efforts are safety-washing”. If (b) has the real power within OpenAI, then OpenAI’s behavior and values can be approximately rounded off to (b)’s behavior and values, and (a) is a rounding error.
  If OAI as a whole was really only doing anything safety-adjacent for pure PR or virtue signaling reasons, I think its activities would have looked pretty different
  Not if (b) is concerned about fortifying OpenAI against future challenges, such as hypothetical futures in which the AGI Doomsayers get their way and the government/the general public wakes up and tries to nationalize or ban AGI research. In that case, having a prepared, well-documented narrative of going above and beyond to ensure that their products are safe, well before any other parties woke up to the threat, will ensure that OpenAI is much more well-positioned to retain control over its research.
  (I interpret Sam Altman’s behavior at Congress as evidence for this kind of longer-term thinking. He didn’t try to downplay the dangers of AI, which would be easy and what someone myopically optimizing for short-term PR would do. He proactively brought up the concerns that future AI progress might awaken, getting ahead of it, and thereby established OpenAI as taking them seriously and put himself into the position to control/manage these concerns.)
  And it’s approximately what I would do, at least, if I were in charge of OpenAI and had a different model of AGI Ruin.
  And this is the potential plot whose partial failure I’m currently celebrating.