simon comments on Measurement tampering detection as a special case of weak-to-strong generalization

simon 23 Dec 2023 23:01 UTC
1 point
0
Yes, it can explore—but its goals should be shaped by the basin it’s been in in the past, so it should not jump to another basin (where the other basin naturally fits a different goal—if they fit the same goal, then they’re effectively the same basin), even if it’s good at exploring. If it does, some assumption has gone wrong, such that the appropriate response is a shutdown not some adjustment.
On the other hand, if it’s very advanced, then it might become powerful enough at some point to act on a misgeneralizaton of its goals such that some policy highly rewarded by the goal is outside any natural basin of the reward system, where acting on it means subverting the reward system. But the smarter it is the less likely it is to misgeneralize in this way (though the more capable it is of acting on it). And in this case the appropriate response is even more clearly a shutdown.
And in the more pedestrian “continuous” case where the goal we’re training on is not quite what we actually want, I’m skeptical you achieve much beyond just adjusting the effective goal slightly.