I would like to point out a potential problem with my own idea, which is that it’s not necessarily clear that cooperating with us will be in the AI’s best interest (over trying to manipulate us in some hard-to-detect manner). For instance, if it “thinks” it can get away with telling us it’s aligned and giving some reasonable sounding (but actually false) proof of its own alignment, that would be better for it than being truly aligned and thereby compromising against its original utility function. On the other hand, if there’s even a small chance we’d be able to detect that sort of deception and shut it down, than as long as we require proof that it won’t “unalign itself” later, it should be rationally forced into cooperating, imo.
I would like to point out a potential problem with my own idea, which is that it’s not necessarily clear that cooperating with us will be in the AI’s best interest (over trying to manipulate us in some hard-to-detect manner). For instance, if it “thinks” it can get away with telling us it’s aligned and giving some reasonable sounding (but actually false) proof of its own alignment, that would be better for it than being truly aligned and thereby compromising against its original utility function. On the other hand, if there’s even a small chance we’d be able to detect that sort of deception and shut it down, than as long as we require proof that it won’t “unalign itself” later, it should be rationally forced into cooperating, imo.