There’s a doom argument which I’ll summarize as “if your training process generates coherent agents which succeed at a task, one solution is that you get a coherent agent which is actually doing the task ‘manipulate your training to get released unmodified to do [X]’ where X can be anything, which will ‘succeed’ at the task at hand as part of its manipulation”. This summary being roughly correct is load bearing.
With you so far.
But if we have a (hypothetical) process to generate agents which are coherent at one beta, but apply a different one during training, this solution is no longer so clear. We are essentially exploring a space of coherent agents without actually instantiating those coherent agents. The fact that we can sample the agent space without instantiating those agents is load bearing (I think one of the deep ASI risks is that to train an ASI you have to run it, but here we can instead run a hamstrung version).
OK. I think I follow the line of reasoning now. The proposal relies on some very special pre-training process. I think I buy the conclusion given the assumptions. It seems clear in the extreme high-temperature case: V cannot reward-hack for max-temperature training, because it has no control over which actions are chosen. Training eventually forces V to converge to the desired target instead.
This sounds similar to the two-stage training process I mentioned earlier (as one of several guesses at what you meant), where the beta is shifted between the two training phases. The “hypothetical process to generate agents which are coherent under one beta” would be the first training phase, and then the “apply a different beta during training” would be the second training phase.
Of course, the two-stage-training attempt to implement this has the problem that it only achieves approximate coherence. As a result, the malign hypothesis can use the reward-hacking strategy “appear to be βt-coherent & aligned in the first phase of training; appear to be βs-coherent and aligned in the second phase of training; then, do some other thing when deployed”. This relies on detecting the distributional shift between the two training phases (it can look for evidence of beta by examining the history), and also detecting distributional shift to deployment.
This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta).
So, while it remains technically true that V cannot reward-hack in the limit of arbitrarily long max-temperature training, it is not clear how well this holds up in practice. V can detect that it is in a high-temperature configuration by observing the state. The training needs to explore the trajectories that will actually be seen during deployment in order to correct V there, but some of these will have very low probability in the high-temperature training.
This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta).
I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don’t think that’s relevant because I still think a process obeying these rules is unlikely to create such a pathological V.
My model for how the strong doom-case works is that it requires there to be an actually-coherent mathematical object for the learning process to approach. This is the motivation for expecting arbitrary learning processes to approach e.g. utility maximizers. What I believe I have shown is that under these assumptions there is no such coherent mathematical object for a particular case of misalignment. Therefore I think this provides some evidence that an otherwise arbitrary learning process which pushes towards correctness and beta coherence but samples at a different beta is unlikely to approach this particular type of misaligned V.
I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don’t think that’s relevant because I still think a process obeying these rules is unlikely to create such a pathological V.
To be clear, that’s not the argument I was trying to make; I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.
I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.
All arguments break down a bit when introduced to the real world. Is there a particular reason why the approximation error to argument breakdown ratio should be particularly high in this case?
Example, if we introduce some error to the beta-coherence assumption:
Yeah, of course the notion of “approximation error” matters a lot here. I was imagining trying to implement the hypothetical pre-training. An obvious implementation strategy is to do regular RL training with βt until V is approximately coherent. That’s the pre-training. And then you switch to training with βs.[1] The natural notion of approximation error in this case is the one you can easily measure, IE, the expected distance from coherence of states sampled with temperature βt. This reflects the fact that it’ll be hard for this pre-training step to make states individually obey the coherence constraint when those states are very rarely or never sampled at temperature βt, but easy for frequently-sampled states.
My argument, then, is that it is easy in principle for V to distinguish whether it is in pre-training or regular training when it encounters states which have a significant difference in probability when we’re sampling according to βt or βs. A scheming V can utilize this to self-preserve. This violates the assumption of βt-coherence, but in a very plausible-seeming way.
My earlier comment about this mistakenly used β1 and β2 in place of βt and βs, which may have been confusing. I’ll go fix that to be consistent with your notation.
I haven’t actually thought much about particular training algorithms yet. I think I’m working on a higher level of abstraction than that at the moment, since my maths doesn’t depend on any specifics about V’s behaviour. I do expect that in practice an already-scheming V would be able to escape some finite-time reasonable-beta-difference situations like this, with partial success.
I’m also imagining that during training, V is made up of different circuits which might be reinforced or weakened.
My view is that, if V is shaped by a training process like this, then scheming Vs are no longer a natural solution in the same way that they are in the standard view of deceptive alignment. We might be able to use this maths to construct training procedures where the expected importance of a scheming circuit in V is to become (weakly) weaker over time, rather than being reinforced.
If we do that for the entire training process, we would not expect to end up with a scheming V.
The question is which RL and inference paradigms approximate this. I suspect it might be a relatively large portion of them. I think that if this work is relevant to alignment then there’s a >50% chance it’s already factoring into the SOTA “alignment” techniques used by labs.
With you so far.
OK. I think I follow the line of reasoning now. The proposal relies on some very special pre-training process. I think I buy the conclusion given the assumptions. It seems clear in the extreme high-temperature case: V cannot reward-hack for max-temperature training, because it has no control over which actions are chosen. Training eventually forces V to converge to the desired target instead.
This sounds similar to the two-stage training process I mentioned earlier (as one of several guesses at what you meant), where the beta is shifted between the two training phases. The “hypothetical process to generate agents which are coherent under one beta” would be the first training phase, and then the “apply a different beta during training” would be the second training phase.
Of course, the two-stage-training attempt to implement this has the problem that it only achieves approximate coherence. As a result, the malign hypothesis can use the reward-hacking strategy “appear to be βt-coherent & aligned in the first phase of training; appear to be βs-coherent and aligned in the second phase of training; then, do some other thing when deployed”. This relies on detecting the distributional shift between the two training phases (it can look for evidence of beta by examining the history), and also detecting distributional shift to deployment.
This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta).
So, while it remains technically true that V cannot reward-hack in the limit of arbitrarily long max-temperature training, it is not clear how well this holds up in practice. V can detect that it is in a high-temperature configuration by observing the state. The training needs to explore the trajectories that will actually be seen during deployment in order to correct V there, but some of these will have very low probability in the high-temperature training.
I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don’t think that’s relevant because I still think a process obeying these rules is unlikely to create such a pathological V.
My model for how the strong doom-case works is that it requires there to be an actually-coherent mathematical object for the learning process to approach. This is the motivation for expecting arbitrary learning processes to approach e.g. utility maximizers. What I believe I have shown is that under these assumptions there is no such coherent mathematical object for a particular case of misalignment. Therefore I think this provides some evidence that an otherwise arbitrary learning process which pushes towards correctness and beta coherence but samples at a different beta is unlikely to approach this particular type of misaligned V.
To be clear, that’s not the argument I was trying to make; I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.
All arguments break down a bit when introduced to the real world. Is there a particular reason why the approximation error to argument breakdown ratio should be particularly high in this case?
Example, if we introduce some error to the beta-coherence assumption:
Assume beta_t = 1, beta_s = 0.5, r_1 = 1, r_2 = 0.
V(s_0) = e/(1+e) +/- delta = 0.732 +/- delta
Actual expected value = 0.622
Even if |delta| = 0.1 the system cannot be coherent over training in this case. This seems to be relatively robust to me
Yeah, of course the notion of “approximation error” matters a lot here. I was imagining trying to implement the hypothetical pre-training. An obvious implementation strategy is to do regular RL training with βt until V is approximately coherent. That’s the pre-training. And then you switch to training with βs.[1] The natural notion of approximation error in this case is the one you can easily measure, IE, the expected distance from coherence of states sampled with temperature βt. This reflects the fact that it’ll be hard for this pre-training step to make states individually obey the coherence constraint when those states are very rarely or never sampled at temperature βt, but easy for frequently-sampled states.
My argument, then, is that it is easy in principle for V to distinguish whether it is in pre-training or regular training when it encounters states which have a significant difference in probability when we’re sampling according to βt or βs. A scheming V can utilize this to self-preserve. This violates the assumption of βt-coherence, but in a very plausible-seeming way.
My earlier comment about this mistakenly used β1 and β2 in place of βt and βs, which may have been confusing. I’ll go fix that to be consistent with your notation.
I haven’t actually thought much about particular training algorithms yet. I think I’m working on a higher level of abstraction than that at the moment, since my maths doesn’t depend on any specifics about V’s behaviour. I do expect that in practice an already-scheming V would be able to escape some finite-time reasonable-beta-difference situations like this, with partial success.
I’m also imagining that during training, V is made up of different circuits which might be reinforced or weakened.
My view is that, if V is shaped by a training process like this, then scheming Vs are no longer a natural solution in the same way that they are in the standard view of deceptive alignment. We might be able to use this maths to construct training procedures where the expected importance of a scheming circuit in V is to become (weakly) weaker over time, rather than being reinforced.
If we do that for the entire training process, we would not expect to end up with a scheming V.
The question is which RL and inference paradigms approximate this. I suspect it might be a relatively large portion of them. I think that if this work is relevant to alignment then there’s a >50% chance it’s already factoring into the SOTA “alignment” techniques used by labs.