abramdemski comments on Turning up the Heat on Deceptively-Misaligned AI

abramdemski 8 Jan 2025 17:25 UTC
2 points
0
If s is terminal then [...] we just have $V (s) = r (s)$ .
If the probability of eventually encountering a terminal state is 1, then beta-coherence alone is inconsistent with deceptive misalignment, right? That’s because we can determine the value of V exactly from the reward function and the oracle, via backwards-induction. (I haven’t revisited RL convergence theorems in a while, I suspect I am not stating this quite right.) I mean, it is still consistent in the case where r is indifferent to the states encountered during training but wants some things in deployment (IE, r is inherently consistent with the provided definition of “deceptively misaligned”). However, it would be inconsistent for r that are not like that.
In other words: you cannot have inner-alignment problems if the outer objective is perfectly imposed. You can only have inner-alignment problems if there are important cases which your training procedure wasn’t able to check (eg, due to distributional shift, or scarcity of data). Perfect beta-coherence combined with a perfect oracle O rules this out.
- J Bostock 8 Jan 2025 19:52 UTC
  2 points
  0
  Parent
  I’m only referring to the reward constraint being satisfied for scenarios that are in the training distribution, since this maths is entirely applied to a decision taking place in training. Therefore I don’t think distributional shift applies.
  - abramdemski 9 Jan 2025 15:19 UTC
    2 points
    0
    Parent
    Ah yep, that’s a good clarification.