I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don’t think that’s relevant because I still think a process obeying these rules is unlikely to create such a pathological V.
To be clear, that’s not the argument I was trying to make; I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.
I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.
All arguments break down a bit when introduced to the real world. Is there a particular reason why the approximation error to argument breakdown ratio should be particularly high in this case?
Example, if we introduce some error to the beta-coherence assumption:
Yeah, of course the notion of “approximation error” matters a lot here. I was imagining trying to implement the hypothetical pre-training. An obvious implementation strategy is to do regular RL training with βt until V is approximately coherent. That’s the pre-training. And then you switch to training with βs.[1] The natural notion of approximation error in this case is the one you can easily measure, IE, the expected distance from coherence of states sampled with temperature βt. This reflects the fact that it’ll be hard for this pre-training step to make states individually obey the coherence constraint when those states are very rarely or never sampled at temperature βt, but easy for frequently-sampled states.
My argument, then, is that it is easy in principle for V to distinguish whether it is in pre-training or regular training when it encounters states which have a significant difference in probability when we’re sampling according to βt or βs. A scheming V can utilize this to self-preserve. This violates the assumption of βt-coherence, but in a very plausible-seeming way.
My earlier comment about this mistakenly used β1 and β2 in place of βt and βs, which may have been confusing. I’ll go fix that to be consistent with your notation.
I haven’t actually thought much about particular training algorithms yet. I think I’m working on a higher level of abstraction than that at the moment, since my maths doesn’t depend on any specifics about V’s behaviour. I do expect that in practice an already-scheming V would be able to escape some finite-time reasonable-beta-difference situations like this, with partial success.
I’m also imagining that during training, V is made up of different circuits which might be reinforced or weakened.
My view is that, if V is shaped by a training process like this, then scheming Vs are no longer a natural solution in the same way that they are in the standard view of deceptive alignment. We might be able to use this maths to construct training procedures where the expected importance of a scheming circuit in V is to become (weakly) weaker over time, rather than being reinforced.
If we do that for the entire training process, we would not expect to end up with a scheming V.
The question is which RL and inference paradigms approximate this. I suspect it might be a relatively large portion of them. I think that if this work is relevant to alignment then there’s a >50% chance it’s already factoring into the SOTA “alignment” techniques used by labs.
To be clear, that’s not the argument I was trying to make; I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.
All arguments break down a bit when introduced to the real world. Is there a particular reason why the approximation error to argument breakdown ratio should be particularly high in this case?
Example, if we introduce some error to the beta-coherence assumption:
Assume beta_t = 1, beta_s = 0.5, r_1 = 1, r_2 = 0.
V(s_0) = e/(1+e) +/- delta = 0.732 +/- delta
Actual expected value = 0.622
Even if |delta| = 0.1 the system cannot be coherent over training in this case. This seems to be relatively robust to me
Yeah, of course the notion of “approximation error” matters a lot here. I was imagining trying to implement the hypothetical pre-training. An obvious implementation strategy is to do regular RL training with βt until V is approximately coherent. That’s the pre-training. And then you switch to training with βs.[1] The natural notion of approximation error in this case is the one you can easily measure, IE, the expected distance from coherence of states sampled with temperature βt. This reflects the fact that it’ll be hard for this pre-training step to make states individually obey the coherence constraint when those states are very rarely or never sampled at temperature βt, but easy for frequently-sampled states.
My argument, then, is that it is easy in principle for V to distinguish whether it is in pre-training or regular training when it encounters states which have a significant difference in probability when we’re sampling according to βt or βs. A scheming V can utilize this to self-preserve. This violates the assumption of βt-coherence, but in a very plausible-seeming way.
My earlier comment about this mistakenly used β1 and β2 in place of βt and βs, which may have been confusing. I’ll go fix that to be consistent with your notation.
I haven’t actually thought much about particular training algorithms yet. I think I’m working on a higher level of abstraction than that at the moment, since my maths doesn’t depend on any specifics about V’s behaviour. I do expect that in practice an already-scheming V would be able to escape some finite-time reasonable-beta-difference situations like this, with partial success.
I’m also imagining that during training, V is made up of different circuits which might be reinforced or weakened.
My view is that, if V is shaped by a training process like this, then scheming Vs are no longer a natural solution in the same way that they are in the standard view of deceptive alignment. We might be able to use this maths to construct training procedures where the expected importance of a scheming circuit in V is to become (weakly) weaker over time, rather than being reinforced.
If we do that for the entire training process, we would not expect to end up with a scheming V.
The question is which RL and inference paradigms approximate this. I suspect it might be a relatively large portion of them. I think that if this work is relevant to alignment then there’s a >50% chance it’s already factoring into the SOTA “alignment” techniques used by labs.