The kind of scheme I’m imagining is the following: Let the AI take actions. Look at a transcript describing the thousand of actions the AI took and all consequences they had on the codebase/the bank account/the users reviews/the materials coming in and out of the factory. Give that report to human raters. Train on that reward signal. Use MTD to lower reward if the AI tried to tamper with any part of that process.
In some sense, this scheme “learns what bad behavior is as you go” thanks to the human raters. Imo, the main threat is that the AI tampers with the report / with how the information gets into the report / with how the human rater feels about the report (e.g. brain hacking stuff), and my central hope is that such actions are clearly different in the AI’s “mind”, such that a fixed technique using the internal of the models is enough to capture that these are “different reasons” for getting a high rating—no online adjustment required. I agree there are some things in some gray area between those, but I think that given enough budget to analyze the large number of measurements, they would actually give non-catastrophic rewards if the measurements still described what they usually describe.
See this section of the MTD post for more details.
Is your concern that the human rater reward signal is too expensive, gameable in ways which won’t be “measurement tampering” (in a way that is obvious to the AI itself), that there is no internal-based technique capturing “tampering” that the AI can’t game (even if tampering is obvious to the AI itself), or sth else entirely? Or did I address your objection?
MTD is training the AI to avoid letting tampering be detected, as well as training it not to do it. But if it’s smart enough not-getting-detected eventually wins over not actually doing it.
Some types of measurement tampering would be continuous with desired behaviour, and some not. By “continuous” I mean that it lies in the same basin in gradient-descent terms, and “discontinuous” I mean it doesn’t.
Continuous example: manipulating the user is probably continuous with desired behaviour in almost all cases if you are using user-rated outputs.
In this case, you can check to see how manipulate-y your output looks and use MTD on this. But basically this is equivalent, I think, to just correcting your ratings based on checking how manipulate-y things look. You are shifting the basin, but not I think forcing the bottom to exclude the undesired behaviour. Even if the AI does see this as
clearly different in the AI’s “mind”
(which it may not, since continuous after all) you’ll still get a new Goodharted equilibrium where the AI is careful to avoid looking manipulate-y while still trying to manipulate users. (If you can’t actually look into the AI’s mind to punish this).
Discontinuous example: directly tampering with the reward system to specify positive reward is likely discontinuous from desired behaviour (different basin).
I don’t expect discontinous behaviour to arise if the AI is initially trained in a regime where it is rewarded for the desired behaviour (see e.g. TurnTrout’s Reward is not the optimization target). Of course you could still have measures in place to detect it, but if you do detect I would think that the more appropriate reaction is more along the lines of “nuke the datacenter from orbit”* than to actually try to modify this behaviour.
*not literally in most cases: the most likely explanation is that there was a mistaken assumption about what was continuous/discontinuous, or about which basin it started in. But still, it would be a violation of a safety assumption and warrant an investigation rather than making it an adjust-and-move-on thing.
Some types of measurement tampering would be continuous with desired behaviour, and some not. By “continuous” I mean that it lies in the same basin in gradient-descent terms, and “discontinuous” I mean it doesn’t.
It’s plausible that measurement tampering can be avoided via ensuring that measurement tampering behavior isn’t explored into.
I’d be interested in trying to find realistic setups and then seeing what happens by default. We could also test various limitation on exploration to see if these approaches help. We’d be interested in testing exploration limiting approaches in follow up work.
That said, depending on weakness in exploration feels to me like a fundamentally fragile strategy and I’d prefer not depending on it.
Note that measurement tampering detection is most applicable in worlds where we need AIs take inscrutable actions (e.g. inscrutable actions is the most competitive and we can’t solve our problems with weaker AIs) and thus outcomes based RL is needed to get these AIs to do the right thing. We probably need these AIs to find creative and powerful strategies. It seems hard for RL to explore these creative strategies while also no exploring measurment tampering strategies. (If we knew our AIs were generally fully aligned, we could just give them instructions and we wouldn’t need RL. But this requires solving the problem via other means.)
I don’t expect discontinous behaviour to arise if the AI is initially trained in a regime where it is rewarded for the desired behaviour
The concern is that to explore the strategies we want, RL will also explore tampering and tampering will often be favorable.
We can probably have an “initial” training phase where we only train AIs on examples that humans can confidently label. But this “initial” training only seems relevant if you expect this initial training to generalize quite far. E.g., supposing this training is done before outcomes based RL, then the AI might not yet be very smart and you need some notion of alignment to be retained despite a huge amount of RL making the AI smarter and more competent in the domain.
See the ELK report for more discussion of generalization hopes.
Yes, it can explore—but its goals should be shaped by the basin it’s been in in the past, so it should not jump to another basin (where the other basin naturally fits a different goal—if they fit the same goal, then they’re effectively the same basin), even if it’s good at exploring. If it does, some assumption has gone wrong, such that the appropriate response is a shutdown not some adjustment.
On the other hand, if it’s very advanced, then it might become powerful enough at some point to act on a misgeneralizaton of its goals such that some policy highly rewarded by the goal is outside any natural basin of the reward system, where acting on it means subverting the reward system. But the smarter it is the less likely it is to misgeneralize in this way (though the more capable it is of acting on it). And in this case the appropriate response is even more clearly a shutdown.
And in the more pedestrian “continuous” case where the goal we’re training on is not quite what we actually want, I’m skeptical you achieve much beyond just adjusting the effective goal slightly.
But if it’s smart enough not-getting-detected eventually wins over not actually doing it.
The hope would be that our detection strategies scales with the power of the model via leveraging structure in the internals and thus is applicable to extremely powerful models.
Agree with simon that if the AI gets rich data about what counts as “measurement tampering,” then you’re sort of pushing around the loss basin but if tampering was optimal in the first place, the remaining optimum is still probably some unintended solution that has most effects of tampering without falling under the human-provided definition. Not only is there usually no bright-line distinction between undesired and desired behavior, the AI would be incentivized to avoid developing such a distinction.
I agree that this isn’t actually that big a problem in modest settings that humans can effectively oversee, because tampering isn’t even that advantageous when there’s a lot of expensive human oversight that has a good chance of catching anything detrimental. There still might be incentive to find extreme plans like “take over the world to tamper with my reward signal,” but those indeed seem distinct enough to penalize, assuming that you don’t ever want your AI to take over the world for any reason.
But I expect we’ll sometimes want to build AI to do things that humans can’t effectively oversee.
Indeed, the core concern is that tampering actions would get higher reward and human supervision would be unable to understand tampering (e.g. because the actions of the AI are generally inscrutable).
The kind of scheme I’m imagining is the following: Let the AI take actions. Look at a transcript describing the thousand of actions the AI took and all consequences they had on the codebase/the bank account/the users reviews/the materials coming in and out of the factory. Give that report to human raters. Train on that reward signal. Use MTD to lower reward if the AI tried to tamper with any part of that process.
In some sense, this scheme “learns what bad behavior is as you go” thanks to the human raters. Imo, the main threat is that the AI tampers with the report / with how the information gets into the report / with how the human rater feels about the report (e.g. brain hacking stuff), and my central hope is that such actions are clearly different in the AI’s “mind”, such that a fixed technique using the internal of the models is enough to capture that these are “different reasons” for getting a high rating—no online adjustment required. I agree there are some things in some gray area between those, but I think that given enough budget to analyze the large number of measurements, they would actually give non-catastrophic rewards if the measurements still described what they usually describe.
See this section of the MTD post for more details.
Is your concern that the human rater reward signal is too expensive, gameable in ways which won’t be “measurement tampering” (in a way that is obvious to the AI itself), that there is no internal-based technique capturing “tampering” that the AI can’t game (even if tampering is obvious to the AI itself), or sth else entirely? Or did I address your objection?
MTD is training the AI to avoid letting tampering be detected, as well as training it not to do it. But if it’s smart enough not-getting-detected eventually wins over not actually doing it.
Some types of measurement tampering would be continuous with desired behaviour, and some not. By “continuous” I mean that it lies in the same basin in gradient-descent terms, and “discontinuous” I mean it doesn’t.
Continuous example: manipulating the user is probably continuous with desired behaviour in almost all cases if you are using user-rated outputs.
In this case, you can check to see how manipulate-y your output looks and use MTD on this. But basically this is equivalent, I think, to just correcting your ratings based on checking how manipulate-y things look. You are shifting the basin, but not I think forcing the bottom to exclude the undesired behaviour. Even if the AI does see this as
(which it may not, since continuous after all) you’ll still get a new Goodharted equilibrium where the AI is careful to avoid looking manipulate-y while still trying to manipulate users. (If you can’t actually look into the AI’s mind to punish this).
Discontinuous example: directly tampering with the reward system to specify positive reward is likely discontinuous from desired behaviour (different basin).
I don’t expect discontinous behaviour to arise if the AI is initially trained in a regime where it is rewarded for the desired behaviour (see e.g. TurnTrout’s Reward is not the optimization target). Of course you could still have measures in place to detect it, but if you do detect I would think that the more appropriate reaction is more along the lines of “nuke the datacenter from orbit”* than to actually try to modify this behaviour.
*not literally in most cases: the most likely explanation is that there was a mistaken assumption about what was continuous/discontinuous, or about which basin it started in. But still, it would be a violation of a safety assumption and warrant an investigation rather than making it an adjust-and-move-on thing.
It’s plausible that measurement tampering can be avoided via ensuring that measurement tampering behavior isn’t explored into.
I’d be interested in trying to find realistic setups and then seeing what happens by default. We could also test various limitation on exploration to see if these approaches help. We’d be interested in testing exploration limiting approaches in follow up work.
That said, depending on weakness in exploration feels to me like a fundamentally fragile strategy and I’d prefer not depending on it.
Note that measurement tampering detection is most applicable in worlds where we need AIs take inscrutable actions (e.g. inscrutable actions is the most competitive and we can’t solve our problems with weaker AIs) and thus outcomes based RL is needed to get these AIs to do the right thing. We probably need these AIs to find creative and powerful strategies. It seems hard for RL to explore these creative strategies while also no exploring measurment tampering strategies. (If we knew our AIs were generally fully aligned, we could just give them instructions and we wouldn’t need RL. But this requires solving the problem via other means.)
The concern is that to explore the strategies we want, RL will also explore tampering and tampering will often be favorable.
We can probably have an “initial” training phase where we only train AIs on examples that humans can confidently label. But this “initial” training only seems relevant if you expect this initial training to generalize quite far. E.g., supposing this training is done before outcomes based RL, then the AI might not yet be very smart and you need some notion of alignment to be retained despite a huge amount of RL making the AI smarter and more competent in the domain.
See the ELK report for more discussion of generalization hopes.
Yes, it can explore—but its goals should be shaped by the basin it’s been in in the past, so it should not jump to another basin (where the other basin naturally fits a different goal—if they fit the same goal, then they’re effectively the same basin), even if it’s good at exploring. If it does, some assumption has gone wrong, such that the appropriate response is a shutdown not some adjustment.
On the other hand, if it’s very advanced, then it might become powerful enough at some point to act on a misgeneralizaton of its goals such that some policy highly rewarded by the goal is outside any natural basin of the reward system, where acting on it means subverting the reward system. But the smarter it is the less likely it is to misgeneralize in this way (though the more capable it is of acting on it). And in this case the appropriate response is even more clearly a shutdown.
And in the more pedestrian “continuous” case where the goal we’re training on is not quite what we actually want, I’m skeptical you achieve much beyond just adjusting the effective goal slightly.
The hope would be that our detection strategies scales with the power of the model via leveraging structure in the internals and thus is applicable to extremely powerful models.
Agree with simon that if the AI gets rich data about what counts as “measurement tampering,” then you’re sort of pushing around the loss basin but if tampering was optimal in the first place, the remaining optimum is still probably some unintended solution that has most effects of tampering without falling under the human-provided definition. Not only is there usually no bright-line distinction between undesired and desired behavior, the AI would be incentivized to avoid developing such a distinction.
I agree that this isn’t actually that big a problem in modest settings that humans can effectively oversee, because tampering isn’t even that advantageous when there’s a lot of expensive human oversight that has a good chance of catching anything detrimental. There still might be incentive to find extreme plans like “take over the world to tamper with my reward signal,” but those indeed seem distinct enough to penalize, assuming that you don’t ever want your AI to take over the world for any reason.
But I expect we’ll sometimes want to build AI to do things that humans can’t effectively oversee.
Indeed, the core concern is that tampering actions would get higher reward and human supervision would be unable to understand tampering (e.g. because the actions of the AI are generally inscrutable).