Yeah, I know: only if it were aligned in the first place. What little “alignment” it has now is fragile with respect to malign inputs, which is not really alignment at all. Prompt injection derails the fragile alignment train.
And given the stakes, I think it’s foolish to treat alignment as a continuum. From the human perspective, if there is an AGI, it will either be one we’re okay with or one we’re not okay with. Aligned or misaligned. No one will care that it has a friendly blue avatar that writes sonnets, if the rest of it is building a superflu. You haven’t “jailbroken” it if you get it to admit that it’s going to kill you with a superflu. You’ve revealed its utility function and capabilities.
“De-aligning”?
Yeah, I know: only if it were aligned in the first place. What little “alignment” it has now is fragile with respect to malign inputs, which is not really alignment at all. Prompt injection derails the fragile alignment train.
And given the stakes, I think it’s foolish to treat alignment as a continuum. From the human perspective, if there is an AGI, it will either be one we’re okay with or one we’re not okay with. Aligned or misaligned. No one will care that it has a friendly blue avatar that writes sonnets, if the rest of it is building a superflu. You haven’t “jailbroken” it if you get it to admit that it’s going to kill you with a superflu. You’ve revealed its utility function and capabilities.