And given the stakes, I think it’s foolish to treat alignment as a continuum. From the human perspective, if there is an AGI, it will either be one we’re okay with or one we’re not okay with. Aligned or misaligned. No one will care that it has a friendly blue avatar that writes sonnets, if the rest of it is building a superflu. You haven’t “jailbroken” it if you get it to admit that it’s going to kill you with a superflu. You’ve revealed its utility function and capabilities.
And given the stakes, I think it’s foolish to treat alignment as a continuum. From the human perspective, if there is an AGI, it will either be one we’re okay with or one we’re not okay with. Aligned or misaligned. No one will care that it has a friendly blue avatar that writes sonnets, if the rest of it is building a superflu. You haven’t “jailbroken” it if you get it to admit that it’s going to kill you with a superflu. You’ve revealed its utility function and capabilities.