Thane Ruthenis comments on Ilya Sutskever and Jan Leike resign from OpenAI [updated]

Thane Ruthenis 18 May 2024 22:10 UTC
9 points
3
Superalignment likely happened because (a) the safety faction (Ilya/Jan/etc.) wanted it, and (b) the Sam faction also wanted it, or tolerated it, or agreed to it due to perceived PR benefits (safety-washing), or let it happen as a result of internal negotiation/compromise, or something else, or some combination of these things.
Sure, that’s basically my model as well. But if the faction (b) only cares about alignment due to perceived PR benefits or in order to appease faction (a), and faction (b) turns out to have overriding power such that it can destroy or drive out faction (a) and then curtail all the alignment efforts, I think it’s fair to compress all that into “OpenAI’s alignment efforts are safety-washing”. If (b) has the real power within OpenAI, then OpenAI’s behavior and values can be approximately rounded off to (b)’s behavior and values, and (a) is a rounding error.
If OAI as a whole was really only doing anything safety-adjacent for pure PR or virtue signaling reasons, I think its activities would have looked pretty different
Not if (b) is concerned about fortifying OpenAI against future challenges, such as hypothetical futures in which the AGI Doomsayers get their way and the government/the general public wakes up and tries to nationalize or ban AGI research. In that case, having a prepared, well-documented narrative of going above and beyond to ensure that their products are safe, well before any other parties woke up to the threat, will ensure that OpenAI is much more well-positioned to retain control over its research.
(I interpret Sam Altman’s behavior at Congress as evidence for this kind of longer-term thinking. He didn’t try to downplay the dangers of AI, which would be easy and what someone myopically optimizing for short-term PR would do. He proactively brought up the concerns that future AI progress might awaken, getting ahead of it, and thereby established OpenAI as taking them seriously and put himself into the position to control/manage these concerns.)
And it’s approximately what I would do, at least, if I were in charge of OpenAI and had a different model of AGI Ruin.
And this is the potential plot whose partial failure I’m currently celebrating.