Some types of measurement tampering would be continuous with desired behaviour, and some not. By “continuous” I mean that it lies in the same basin in gradient-descent terms, and “discontinuous” I mean it doesn’t.
It’s plausible that measurement tampering can be avoided via ensuring that measurement tampering behavior isn’t explored into.
I’d be interested in trying to find realistic setups and then seeing what happens by default. We could also test various limitation on exploration to see if these approaches help. We’d be interested in testing exploration limiting approaches in follow up work.
That said, depending on weakness in exploration feels to me like a fundamentally fragile strategy and I’d prefer not depending on it.
Note that measurement tampering detection is most applicable in worlds where we need AIs take inscrutable actions (e.g. inscrutable actions is the most competitive and we can’t solve our problems with weaker AIs) and thus outcomes based RL is needed to get these AIs to do the right thing. We probably need these AIs to find creative and powerful strategies. It seems hard for RL to explore these creative strategies while also no exploring measurment tampering strategies. (If we knew our AIs were generally fully aligned, we could just give them instructions and we wouldn’t need RL. But this requires solving the problem via other means.)
I don’t expect discontinous behaviour to arise if the AI is initially trained in a regime where it is rewarded for the desired behaviour
The concern is that to explore the strategies we want, RL will also explore tampering and tampering will often be favorable.
We can probably have an “initial” training phase where we only train AIs on examples that humans can confidently label. But this “initial” training only seems relevant if you expect this initial training to generalize quite far. E.g., supposing this training is done before outcomes based RL, then the AI might not yet be very smart and you need some notion of alignment to be retained despite a huge amount of RL making the AI smarter and more competent in the domain.
See the ELK report for more discussion of generalization hopes.
Yes, it can explore—but its goals should be shaped by the basin it’s been in in the past, so it should not jump to another basin (where the other basin naturally fits a different goal—if they fit the same goal, then they’re effectively the same basin), even if it’s good at exploring. If it does, some assumption has gone wrong, such that the appropriate response is a shutdown not some adjustment.
On the other hand, if it’s very advanced, then it might become powerful enough at some point to act on a misgeneralizaton of its goals such that some policy highly rewarded by the goal is outside any natural basin of the reward system, where acting on it means subverting the reward system. But the smarter it is the less likely it is to misgeneralize in this way (though the more capable it is of acting on it). And in this case the appropriate response is even more clearly a shutdown.
And in the more pedestrian “continuous” case where the goal we’re training on is not quite what we actually want, I’m skeptical you achieve much beyond just adjusting the effective goal slightly.
It’s plausible that measurement tampering can be avoided via ensuring that measurement tampering behavior isn’t explored into.
I’d be interested in trying to find realistic setups and then seeing what happens by default. We could also test various limitation on exploration to see if these approaches help. We’d be interested in testing exploration limiting approaches in follow up work.
That said, depending on weakness in exploration feels to me like a fundamentally fragile strategy and I’d prefer not depending on it.
Note that measurement tampering detection is most applicable in worlds where we need AIs take inscrutable actions (e.g. inscrutable actions is the most competitive and we can’t solve our problems with weaker AIs) and thus outcomes based RL is needed to get these AIs to do the right thing. We probably need these AIs to find creative and powerful strategies. It seems hard for RL to explore these creative strategies while also no exploring measurment tampering strategies. (If we knew our AIs were generally fully aligned, we could just give them instructions and we wouldn’t need RL. But this requires solving the problem via other means.)
The concern is that to explore the strategies we want, RL will also explore tampering and tampering will often be favorable.
We can probably have an “initial” training phase where we only train AIs on examples that humans can confidently label. But this “initial” training only seems relevant if you expect this initial training to generalize quite far. E.g., supposing this training is done before outcomes based RL, then the AI might not yet be very smart and you need some notion of alignment to be retained despite a huge amount of RL making the AI smarter and more competent in the domain.
See the ELK report for more discussion of generalization hopes.
Yes, it can explore—but its goals should be shaped by the basin it’s been in in the past, so it should not jump to another basin (where the other basin naturally fits a different goal—if they fit the same goal, then they’re effectively the same basin), even if it’s good at exploring. If it does, some assumption has gone wrong, such that the appropriate response is a shutdown not some adjustment.
On the other hand, if it’s very advanced, then it might become powerful enough at some point to act on a misgeneralizaton of its goals such that some policy highly rewarded by the goal is outside any natural basin of the reward system, where acting on it means subverting the reward system. But the smarter it is the less likely it is to misgeneralize in this way (though the more capable it is of acting on it). And in this case the appropriate response is even more clearly a shutdown.
And in the more pedestrian “continuous” case where the goal we’re training on is not quite what we actually want, I’m skeptical you achieve much beyond just adjusting the effective goal slightly.