Maybe the project will come up with some mechanism that detects that. But if they fall back to the naive “just watch what it does in the test environment and assume it’ll do the same in production,” then there is a risk it’s going to figure out it’s in a test environment, and that its judges would not react well to finding out what is wrong with its utility function, and then it will act aligned in the testing environment.
If we ever see a news headline saying “Good News, AGI seems to ‘self-align’ regardless of the sign of the utility function!” that will be some very bad news.
I asked Rohin Shah about that possibility in a question thread about a month ago. I think he’s probably right that this type of thing would only plausibly make it through the training process if the system’s *already* smart enough to be able to think about this type of thing. And then on top of that there are still things like sanity checks which, while unlikely to pick up numerous errors, would probably notice a sign error. See also this comment:
Furthermore, if an AGI design has an actually-serious flaw, the likeliest consequence that I expect is not catastrophe; it’s just that the system doesn’t work. Another likely consequence is that the system is misaligned, but in an obvious ways that makes it easy for developers to recognize that deployment is a very bad idea.
IMO it’s incredibly important that we find a way to prevent this type of thing from occurring *after* the system has been trained, whether that be hyperexistential separation or something else. I think that a team that’s safety-conscious enough to come up with a (reasonably) aligned AGI design is going to put a considerable amount of effort into fixing bugs & one as obvious as a sign error would be unlikely to make it through. And hopefully—even better, they would have come up with a utility function that can’t be easily reversed by a single bit flip or doesn’t cause outcomes worse than death when minimised. That’d (hopefully?) solve the SignFlip issue *regardless* of what causes it.
Maybe the project will come up with some mechanism that detects that. But if they fall back to the naive “just watch what it does in the test environment and assume it’ll do the same in production,” then there is a risk it’s going to figure out it’s in a test environment, and that its judges would not react well to finding out what is wrong with its utility function, and then it will act aligned in the testing environment.
If we ever see a news headline saying “Good News, AGI seems to ‘self-align’ regardless of the sign of the utility function!” that will be some very bad news.
I asked Rohin Shah about that possibility in a question thread about a month ago. I think he’s probably right that this type of thing would only plausibly make it through the training process if the system’s *already* smart enough to be able to think about this type of thing. And then on top of that there are still things like sanity checks which, while unlikely to pick up numerous errors, would probably notice a sign error. See also this comment:
IMO it’s incredibly important that we find a way to prevent this type of thing from occurring *after* the system has been trained, whether that be hyperexistential separation or something else. I think that a team that’s safety-conscious enough to come up with a (reasonably) aligned AGI design is going to put a considerable amount of effort into fixing bugs & one as obvious as a sign error would be unlikely to make it through. And hopefully—even better, they would have come up with a utility function that can’t be easily reversed by a single bit flip or doesn’t cause outcomes worse than death when minimised. That’d (hopefully?) solve the SignFlip issue *regardless* of what causes it.