What sorts of AI designs could not be made to pursue a flipped utility function via perturbation in one spot? One quick guess: an AI that represents its utility function in several places and uses all of those representations to do error correction, only pursuing the error corrected utility function.
Just a phrasing/terminology nitpick: I think this applies to agents with externally-imposed utility functions. If an agent has a “natural” or “intrinsic” utility function which it publishes explicitly (and does not accept updates to that explicit form), I think the risk of bugs in representation does not occur.
Agents that explicitly represent their utility function are potentially vulnerable to sign flips.
What sorts of AI designs could not be made to pursue a flipped utility function via perturbation in one spot? One quick guess: an AI that represents its utility function in several places and uses all of those representations to do error correction, only pursuing the error corrected utility function.
Just a phrasing/terminology nitpick: I think this applies to agents with externally-imposed utility functions. If an agent has a “natural” or “intrinsic” utility function which it publishes explicitly (and does not accept updates to that explicit form), I think the risk of bugs in representation does not occur.