abramdemski comments on Turning up the Heat on Deceptively-Misaligned AI

abramdemski 8 Jan 2025 17:06 UTC
2 points
0
Yeah, of course the notion of “approximation error” matters a lot here. I was imagining trying to implement the hypothetical pre-training. An obvious implementation strategy is to do regular RL training with $β_{t}$ until V is approximately coherent. That’s the pre-training. And then you switch to training with $β_{s}$ .^[1] The natural notion of approximation error in this case is the one you can easily measure, IE, the expected distance from coherence of states sampled with temperature $β_{t}$ . This reflects the fact that it’ll be hard for this pre-training step to make states individually obey the coherence constraint when those states are very rarely or never sampled at temperature $β_{t}$ , but easy for frequently-sampled states.
My argument, then, is that it is easy in principle for V to distinguish whether it is in pre-training or regular training when it encounters states which have a significant difference in probability when we’re sampling according to $β_{t}$ or $β_{s}$ . A scheming V can utilize this to self-preserve. This violates the assumption of $β_{t}$ -coherence, but in a very plausible-seeming way.
1. ^
  My earlier comment about this mistakenly used $β_{1}$ and $β_{2}$ in place of $β_{t}$ and $β_{s}$ , which may have been confusing. I’ll go fix that to be consistent with your notation.
- J Bostock 8 Jan 2025 19:45 UTC
  2 points
  0
  Parent
  I haven’t actually thought much about particular training algorithms yet. I think I’m working on a higher level of abstraction than that at the moment, since my maths doesn’t depend on any specifics about V’s behaviour. I do expect that in practice an already-scheming V would be able to escape some finite-time reasonable-beta-difference situations like this, with partial success.
  I’m also imagining that during training, V is made up of different circuits which might be reinforced or weakened.
  My view is that, if V is shaped by a training process like this, then scheming Vs are no longer a natural solution in the same way that they are in the standard view of deceptive alignment. We might be able to use this maths to construct training procedures where the expected importance of a scheming circuit in V is to become (weakly) weaker over time, rather than being reinforced.
  If we do that for the entire training process, we would not expect to end up with a scheming V.
  The question is which RL and inference paradigms approximate this. I suspect it might be a relatively large portion of them. I think that if this work is relevant to alignment then there’s a >50% chance it’s already factoring into the SOTA “alignment” techniques used by labs.