Agree with simon that if the AI gets rich data about what counts as “measurement tampering,” then you’re sort of pushing around the loss basin but if tampering was optimal in the first place, the remaining optimum is still probably some unintended solution that has most effects of tampering without falling under the human-provided definition. Not only is there usually no bright-line distinction between undesired and desired behavior, the AI would be incentivized to avoid developing such a distinction.
I agree that this isn’t actually that big a problem in modest settings that humans can effectively oversee, because tampering isn’t even that advantageous when there’s a lot of expensive human oversight that has a good chance of catching anything detrimental. There still might be incentive to find extreme plans like “take over the world to tamper with my reward signal,” but those indeed seem distinct enough to penalize, assuming that you don’t ever want your AI to take over the world for any reason.
But I expect we’ll sometimes want to build AI to do things that humans can’t effectively oversee.
Indeed, the core concern is that tampering actions would get higher reward and human supervision would be unable to understand tampering (e.g. because the actions of the AI are generally inscrutable).
Agree with simon that if the AI gets rich data about what counts as “measurement tampering,” then you’re sort of pushing around the loss basin but if tampering was optimal in the first place, the remaining optimum is still probably some unintended solution that has most effects of tampering without falling under the human-provided definition. Not only is there usually no bright-line distinction between undesired and desired behavior, the AI would be incentivized to avoid developing such a distinction.
I agree that this isn’t actually that big a problem in modest settings that humans can effectively oversee, because tampering isn’t even that advantageous when there’s a lot of expensive human oversight that has a good chance of catching anything detrimental. There still might be incentive to find extreme plans like “take over the world to tamper with my reward signal,” but those indeed seem distinct enough to penalize, assuming that you don’t ever want your AI to take over the world for any reason.
But I expect we’ll sometimes want to build AI to do things that humans can’t effectively oversee.
Indeed, the core concern is that tampering actions would get higher reward and human supervision would be unable to understand tampering (e.g. because the actions of the AI are generally inscrutable).