Specification gaming: Even on the training distribution, the AI is taking object-level actions that humans think are bad. This can include bad stuff that they don’t notice at the time, but that seems obvious to us humans abstractly reasoning about the hypothetical.
Do you mean “the AI is taking object-level actions that humans think are bad while achieving high reward”?
If so, I don’t see how this solves the problem. I still claim that every reward function can be gamed in principle, absent assumptions about the AI in question.
Do you mean “the AI is taking object-level actions that humans think are bad while achieving high reward”?
If so, I don’t see how this solves the problem. I still claim that every reward function can be gamed in principle, absent assumptions about the AI in question.
Sure, something like that.
I agree it doesn’t solve the problem if you don’t use information / assumptions about the AI in question.