abramdemski comments on Turning up the Heat on Deceptively-Misaligned AI

abramdemski 8 Jan 2025 1:22 UTC
4 points
0
I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don’t think that’s relevant because I still think a process obeying these rules is unlikely to create such a pathological V.
To be clear, that’s not the argument I was trying to make; I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.
- J Bostock 8 Jan 2025 8:58 UTC
  2 points
  0
  Parent
  I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.
  All arguments break down a bit when introduced to the real world. Is there a particular reason why the approximation error to argument breakdown ratio should be particularly high in this case?
  Example, if we introduce some error to the beta-coherence assumption:
  Assume beta_t = 1, beta_s = 0.5, r_1 = 1, r_2 = 0.
  V(s_0) = e/(1+e) +/- delta = 0.732 +/- delta
  Actual expected value = 0.622
  Even if |delta| = 0.1 the system cannot be coherent over training in this case. This seems to be relatively robust to me
  - abramdemski 8 Jan 2025 17:06 UTC
    2 points
    0
    Parent
    Yeah, of course the notion of “approximation error” matters a lot here. I was imagining trying to implement the hypothetical pre-training. An obvious implementation strategy is to do regular RL training with $β_{t}$ until V is approximately coherent. That’s the pre-training. And then you switch to training with $β_{s}$ .^[1] The natural notion of approximation error in this case is the one you can easily measure, IE, the expected distance from coherence of states sampled with temperature $β_{t}$ . This reflects the fact that it’ll be hard for this pre-training step to make states individually obey the coherence constraint when those states are very rarely or never sampled at temperature $β_{t}$ , but easy for frequently-sampled states.
    My argument, then, is that it is easy in principle for V to distinguish whether it is in pre-training or regular training when it encounters states which have a significant difference in probability when we’re sampling according to $β_{t}$ or $β_{s}$ . A scheming V can utilize this to self-preserve. This violates the assumption of $β_{t}$ -coherence, but in a very plausible-seeming way.
    ^
    My earlier comment about this mistakenly used $β_{1}$ and $β_{2}$ in place of $β_{t}$ and $β_{s}$ , which may have been confusing. I’ll go fix that to be consistent with your notation.
    - J Bostock 8 Jan 2025 19:45 UTC
      2 points
      0
      Parent
      I haven’t actually thought much about particular training algorithms yet. I think I’m working on a higher level of abstraction than that at the moment, since my maths doesn’t depend on any specifics about V’s behaviour. I do expect that in practice an already-scheming V would be able to escape some finite-time reasonable-beta-difference situations like this, with partial success.
      I’m also imagining that during training, V is made up of different circuits which might be reinforced or weakened.
      My view is that, if V is shaped by a training process like this, then scheming Vs are no longer a natural solution in the same way that they are in the standard view of deceptive alignment. We might be able to use this maths to construct training procedures where the expected importance of a scheming circuit in V is to become (weakly) weaker over time, rather than being reinforced.
      If we do that for the entire training process, we would not expect to end up with a scheming V.
      The question is which RL and inference paradigms approximate this. I suspect it might be a relatively large portion of them. I think that if this work is relevant to alignment then there’s a >50% chance it’s already factoring into the SOTA “alignment” techniques used by labs.