J Bostock comments on Turning up the Heat on Deceptively-Misaligned AI

J Bostock 7 Jan 2025 16:34 UTC
2 points
0
I think you’re right, correctness and beta-coherence can be rolled up into one specific property. I think I wrote down correctness as a constraint first, then tried to add coherence, but the specific property is that:
For non-terminal s, this can be written as:
$V (s) = r (s) + \sum V (O (s, a i) e x p (β_{t} \times V (O (s, a i)) \sum e x p (β_{t} \times V (O (s, a i))$
If s is terminal then [...] we just have $V (s) = r (s)$ .
Which captures both. I will edit the post to clarify this when I get time.
- abramdemski 8 Jan 2025 17:25 UTC
  2 points
  0
  Parent
  If s is terminal then [...] we just have $V (s) = r (s)$ .
  If the probability of eventually encountering a terminal state is 1, then beta-coherence alone is inconsistent with deceptive misalignment, right? That’s because we can determine the value of V exactly from the reward function and the oracle, via backwards-induction. (I haven’t revisited RL convergence theorems in a while, I suspect I am not stating this quite right.) I mean, it is still consistent in the case where r is indifferent to the states encountered during training but wants some things in deployment (IE, r is inherently consistent with the provided definition of “deceptively misaligned”). However, it would be inconsistent for r that are not like that.
  In other words: you cannot have inner-alignment problems if the outer objective is perfectly imposed. You can only have inner-alignment problems if there are important cases which your training procedure wasn’t able to check (eg, due to distributional shift, or scarcity of data). Perfect beta-coherence combined with a perfect oracle O rules this out.
  - J Bostock 8 Jan 2025 19:52 UTC
    2 points
    0
    Parent
    I’m only referring to the reward constraint being satisfied for scenarios that are in the training distribution, since this maths is entirely applied to a decision taking place in training. Therefore I don’t think distributional shift applies.
- Joseph Miller 8 Jan 2025 16:42 UTC
  2 points
  0
  Parent
  correctness and beta-coherence can be rolled up into one specific property
  Is that rolling up two things into one, or is that just beta-coherence?