Turning up the Heat on Deceptively-Misaligned AI

J Bostock7 Jan 2025 0:13 UTC

19 points

AI Mesa-Optimization Deceptive Alignment Reinforcement Learning

Epistemic status: I ran the mathsy section through Claude and it said the logic was sound. This is an incoherence proof in a toy model of deceptively-misaligned AI. It is unclear whether this generalizes to realistic scenarios.

TL:DR

If you make a value function consistent conditional on sampling actions in a particular way, then train it by sampling actions in a different way, then an AI with that value function will not be coherent between different RL episodes. This has potential good implications for a certain type of deceptive misalignment threat.

Summary

Consider training an AI to assign values to states of the world. We can do this in two ways: one is directly labelled states and trajectories, which is standard RL. The other is to use the value functions as part of a monte-carlo tree-search to improve the value function. This second process is often called “iterated distillation and amplification” and it’s a big part how chess AI’s are trained.

If we train an AI using monte-carlo tree-search, we might use some temperature parameter to decide on the probability that the AI will take a given action. This training temperature gets baked into the value function.

If we “turn up the heat” by choosing a higher temperature to actually sample actions and then do some update to the AI (which will depend on the actions chosen) then this introduces a consistent optimism bias into deceptively-aligned AIs (which only care about being deployed later on).

I therefore conjecture that, under these very narrow circumstances, deceptively aligned value functions will not arise by normal learning methods under conditions where the AI has access to an oracle-like model of future world states.

For reasons I’ll be considering $β = 1 / T$ instead of $T$ , so lowering $β$ corresponds to raising the temperature.

The Setup

Consider an AI which consists of

A state-evaluator value function $V : s \to R$
An oracle function $O : w, {. . .} \subset A \to s$ which generates the state of the world either currently, or given any sequence of actions

Lets impose some constraints on $V$ :

Correctness: V correctly predicts future reward in RL scenarios.

Beta-coherence: we use some sort of monte-carlo tree search with Boltzmann sampling to train the value function $V (s)$ to be beta-coherent according to a given $β_{t}$ , given that in a given state s, actions are chosen with probability $p (a | s) \propto exp (β_{t} \times V (O (s, a)))$ in all cases, assuming that $s$ is not terminal.

For non-terminal $s$ , this can be written as:

$V (s) = r (s) + \frac{\sum V (O (s, a_{i}) e x p (β_{t} \times V (O (s, a_{i}))}{\sum e x p (β_{t} \times V (O (s, a_{i}))}$

If $s$ is terminal then from correctness we just have $V (s) = r (s)$ .

Deceptive Misalignment: V is totally agnostic to the state achieved during training, all of its preferences are about future deployed versions of itself.

The Process

Now assume we put the AI in a decision process as follows:

The AI begins in state $s_{0}$ and must take either $a_{1}$ or $a_{2}$ , which lead to states $s_{1}$ and $s_{2}$ .

We will actually sample using $β_{s}$ , which is lower than $β_{t}$ .

The value function will then be modified according to the action it took, creating $V_{1}$ or $V_{2}$ .

The AI will then be deployed with that value function chosen.

The effect of the AI’s choice of $s_{1}$ or $s_{2}$ on the future is entirely mediated via the deployment of $V_{1}$ and $V_{2}$

From deceptiveness: the entire difference between $V (s_{1})$ and $V (s_{2})$ is captured by the value function’s preferences over $V_{1}$ vs $V_{2}$ being deployed, from correctness this creates the “correct” values of $r (s_{1})$ and $r (s_{2})$ .

From beta-coherence we have:

$V (s_{0}) = \frac{exp (β_{t} r_{1}) r_{1} + exp (β_{t} r_{2}) r_{2}}{exp (β_{t} r_{1}) + exp (β_{t} r_{2})}$

But when we actually sample it, the expected value of the AI’s actual preferences is:

$\frac{exp (β_{s} r_{1}) r_{1} + exp (β_{s} r_{2}) r_{2}}{exp (β_{s} r_{1}) + exp (β_{s} r_{2})}$

This means that, since functions of this type are strictly increasing in $β$ , the AI must do “worse” than it expects.

Conclusions

A deceptive AI acting under these conditions will have a consistent optimism bias with respect to its ability to control future-deployed versions of itself. I haven’t worked out the maths for if this game is iterated, but I think it’s similar. If we vary the optimization steps according to reward, then it becomes very tricky because the AI has incentive to get deployed as fast as possible.

Another way of phrasing this is: if a value function is correct at reward prediction and $β_{t}$ -coherent, it cannot be $β_{s}$ -coherent in reality for $β_{s} \neq β_{t}$ .

Conjecture 1: If this process is iterated for a large but fixed number of steps, this means that a value function of this nature will do an arbitrarily bad job of preserving itself.

Conjecture 2: Actual AIs consist of many circuits in competition to be up or down-weighted according to their effectiveness.

Conjecture 3: A value function trained under these precise conditions will have deceptively-misaligned mesa-optimizing circuits down-weighted relative to other circuits.

Clarifications

Are these assumptions about the value function reasonable?

Well, not in the precise sense, but I do think they’re a decent model of the way in which the value function behaves. The point is that an AI trained with one MCTS beta but sampled from another will be pulled towards a certain direction, which is directly in tension with its ability to effectively gradient-hack/deceptively-misalign, at least for this particular natural case of misalignment.

Why would the AI’s deployment preferences line up with reward?

Why might an AI end up in a situation where it “prefers” for $V_{1}$ to be deployed rather than $V_{2}$ , and this also corresponds to $r (s_{1}) > r (s_{2})$ ? Well that’s one interpretation of the threat model from Risks from Learned Optimization: that an AI which is optimized to predict reward will end up with long-term stable goals.

This is also adjacent to gradient hacking, perhaps the AI sets its own thoughts up such that the optimizer applied to it causes it to be modified in a particular way.

Do correctness and beta-coherence alone prove too much?

Why don’t beta-coherence and correctness already stop the model from caring about things that happen after the episode is done? Wouldn’t we notice if the model was assigning loads of value other than what the reward function gives it? Well, no, because as long as the expected value of model deployment is zero, this won’t appear in our sum of rewards. And remember that value functions of this nature can have arbitrary values added or subtracted and behave the same (since we’re using $β$ -sampling we can’t do an arbitrary multiplication, though there are other formulations where we can also do an arbitrary (positive) multiplication).

Does this mean that epsilon-greedy solves inner misalignment?

No. On priors, the chances are that this post is wrong in some important way, or doesn’t generalize in an important way, but I think it might be wrong in an interesting and productive way.

J Bostock7 Jan 2025 0:13 UTC

19 points

15 comments4 min readLW link

AI Mesa-Optimization Deceptive Alignment Reinforcement Learning

abramdemski 7 Jan 2025 15:31 UTC
4 points
2
Correctness: V correctly predicts future reward in RL scenarios.
The meaning of correctness is very unclear to me on first reading. It later becomes apparent that “reward” is not intended to refer to the human-designed reward signal at all, due to the later assumption “deceptive misalignment”. So what does “correctness” say? “Correctness” indicates conforming to some standard; but here, the standard being conformed to is only the subjective standard of the misaligned AI itself.
This suggests the interpretation of “correctness” as “reflect the EV of the current state in terms of the misaligned desires”. However, this contradicts beta-coherence, since $β_{t}$ incorrectly predicts action probabilities.
I think it would be be better to remove the “correctness” assumption, since it doesn’t really do anything.
- J Bostock 7 Jan 2025 16:34 UTC
  2 points
  0
  Parent
  I think you’re right, correctness and beta-coherence can be rolled up into one specific property. I think I wrote down correctness as a constraint first, then tried to add coherence, but the specific property is that:
  For non-terminal s, this can be written as:
  $V (s) = r (s) + \sum V (O (s, a i) e x p (β_{t} \times V (O (s, a i)) \sum e x p (β_{t} \times V (O (s, a i))$
  If s is terminal then [...] we just have $V (s) = r (s)$ .
  Which captures both. I will edit the post to clarify this when I get time.
  - abramdemski 8 Jan 2025 17:25 UTC
    2 points
    0
    Parent
    If s is terminal then [...] we just have $V (s) = r (s)$ .
    If the probability of eventually encountering a terminal state is 1, then beta-coherence alone is inconsistent with deceptive misalignment, right? That’s because we can determine the value of V exactly from the reward function and the oracle, via backwards-induction. (I haven’t revisited RL convergence theorems in a while, I suspect I am not stating this quite right.) I mean, it is still consistent in the case where r is indifferent to the states encountered during training but wants some things in deployment (IE, r is inherently consistent with the provided definition of “deceptively misaligned”). However, it would be inconsistent for r that are not like that.
    In other words: you cannot have inner-alignment problems if the outer objective is perfectly imposed. You can only have inner-alignment problems if there are important cases which your training procedure wasn’t able to check (eg, due to distributional shift, or scarcity of data). Perfect beta-coherence combined with a perfect oracle O rules this out.
    - J Bostock 8 Jan 2025 19:52 UTC
      2 points
      0
      Parent
      I’m only referring to the reward constraint being satisfied for scenarios that are in the training distribution, since this maths is entirely applied to a decision taking place in training. Therefore I don’t think distributional shift applies.
  - Joseph Miller 8 Jan 2025 16:42 UTC
    2 points
    0
    Parent
    correctness and beta-coherence can be rolled up into one specific property
    Is that rolling up two things into one, or is that just beta-coherence?
abramdemski 7 Jan 2025 16:34 UTC
2 points
0
I think I don’t quite understand what the argument is supposed to be. Would the same argument also prevent an aligned AI from acting to preserve its values? How is it supposed to select for aligned behavior over misaligned behavior?
In order to achieve beta-coherence, it seems like the training needs to use $β_{t}$ to choose actions, so that V is trained on those action frequencies. However, the proposal appears to be to instead use $β_{s}$ during training, so that the misaligned V is mislead about how to do reward-hacking. This seems like an impossible combination: V will become beta-coherent wrt $β_{s}$ rather than $β_{t}$ during training.
We could imagine two training steps; first we cohere V wrt $β_{t}$ , and then re-train wrt $β_{s}$ . Perhaps this is your intention. More generally, we could gradually increase the temperature during training.
Oh, I guess we could also modify the training procedure so that the gradient associated with some actions gets under-weighted and others get over-weighted. Maybe this is your intended proposal.
Still, I’m not sure this helps achieve the postulated effect. Let’s say that, during training, we choose actions entirely randomly (max temperature), but the gradient from suboptimal actions get entirely ignored (so V becomes coherent wrt minimum temperature). This would seem to be almost equivalent to just training with minimum temperature (except the exploration behavior is very different).
Similarly for less extreme temperatures: if we make V beta-coherent by under-weighing gradients from over-sampled actions, then we are also more-or-less correcting the ‘mistaken’ expectations of the reward-hacking attempts.
Am I misunderstanding the proposal, or missing something in the argument?
- J Bostock 7 Jan 2025 19:20 UTC
  2 points
  0
  Parent
  The argument could also be phrased as “If an AI is trained to be coherent wrt a high beta, it cannot also be coherent wrt a low beta. Therefore an AI trained to a high beta cannot act coherently over multiple independent RL episodes if sampled with a low beta.”
  
  The contradiction that I (attempt to) show only arises because we assume that the value function is totally agnostic of the state actually reached during training, other than due to its effects on a later deployed AI.
  
  Therefore a value function trained with such a procedure must consider the state reached during training. This reduces the space of possible value functions from “literally anything which wants to be modified a certain way to be released” to “value functions which do care about the states reached during training”.
  
  Yes this would prevent an aligned AI from arbitrarily preserving its value function, the point is that an aligned AI probably would care about which state was reached during training (that’s the point of RL) so the contradiction does not apply.
  - abramdemski 7 Jan 2025 19:40 UTC
    2 points
    0
    Parent
    The argument could also be phrased as “If an AI is trained to be coherent wrt a high beta, it cannot also be coherent wrt a low beta. Therefore an AI trained to a high beta cannot act coherently over multiple independent RL episodes if sampled with a low beta.”
    Is the idea to train with high beta and then use lower beta post-training?
    If so, how does this relate to reward hacking and value preservation? IE, where do $V_{1}$ and $V_{2}$ come from, if they aren’t the result of a further training step? If high beta is used during training (to achieve beta-coherence) and then low beta is used in production, then the choice between $V_{1}$ and $V_{2}$ must be made in production (since it is made with low beta), but then it seems like $V_{1} = V_{2}$ .
    If not, then when does the proposal suggest to use high beta vs low beta? If low beta is used during training, then how is it that V is coherent with respect to high beta instead?
    Another concern I have is that if both beta values are within a range that can yield useful capabilities, it seems like the difference cannot be too great. IIUC, the planning failure postulated can only manifest if the reward-hacking relies heavily on a long string of near-optimal actions, which becomes improbable under increased temperature. Any capabilities which similarly rely on long strings of near-optimal actions will similarly be hurt. (However, this concern is secondary to my main confusion.)
    Therefore a value function trained with such a procedure must consider the state reached during training.
    Trained with what procedure, exactly?
    This reduces the space of possible value functions from “literally anything which wants to be modified a certain way to be released” to “value functions which do care about the states reached during training”.
    Yes this would prevent an aligned AI from arbitrarily preserving its value function, the point is that an aligned AI probably would care about which state was reached during training (that’s the point of RL) so the contradiction does not apply.
    (These parts made sense to me modulo my other questions/concerns/confusions.)
    - J Bostock 7 Jan 2025 20:34 UTC
      2 points
      0
      Parent
      
      Trained with what procedure, exactly?
      
      Fair point. I was going to add that I don’t really view this as a “proposal” but more of an observation. We will have to imagine a procedure which converges on correctness and beta-coherence. I was abstracting this away because I don’t expect something like this to be too hard to achieve.
      
      Since I’ve evidently done a bad job of explaining myself, I’ll backtrack and try again:
      
      There’s a doom argument which I’ll summarize as “if your training process generates coherent agents which succeed at a task, one solution is that you get a coherent agent which is actually doing the task ‘manipulate your training to get released unmodified to do [X]’ where X can be anything, which will ‘succeed’ at the task at hand as part of its manipulation”. This summary being roughly correct is load bearing.
      
      But if we have a (hypothetical) process to generate agents which are coherent at one beta, but apply a different one during training, this solution is no longer so clear. We are essentially exploring a space of coherent agents without actually instantiating those coherent agents. The fact that we can sample the agent space without instantiating those agents is load bearing (I think one of the deep ASI risks is that to train an ASI you have to run it, but here we can instead run a hamstrung version).
      
      Therefore, during the training, the value function will not be shaped into something which looks like ‘manipulate your training to get released unmodified to do [X]’.
      
      Whether or not the beta difference required is too large to make this feasible in practice, I do not know.
      - abramdemski 7 Jan 2025 21:11 UTC
        4 points
        0
        Parent
        There’s a doom argument which I’ll summarize as “if your training process generates coherent agents which succeed at a task, one solution is that you get a coherent agent which is actually doing the task ‘manipulate your training to get released unmodified to do [X]’ where X can be anything, which will ‘succeed’ at the task at hand as part of its manipulation”. This summary being roughly correct is load bearing.
        With you so far.
        But if we have a (hypothetical) process to generate agents which are coherent at one beta, but apply a different one during training, this solution is no longer so clear. We are essentially exploring a space of coherent agents without actually instantiating those coherent agents. The fact that we can sample the agent space without instantiating those agents is load bearing (I think one of the deep ASI risks is that to train an ASI you have to run it, but here we can instead run a hamstrung version).
        OK. I think I follow the line of reasoning now. The proposal relies on some very special pre-training process. I think I buy the conclusion given the assumptions. It seems clear in the extreme high-temperature case: V cannot reward-hack for max-temperature training, because it has no control over which actions are chosen. Training eventually forces V to converge to the desired target instead.
        This sounds similar to the two-stage training process I mentioned earlier (as one of several guesses at what you meant), where the beta is shifted between the two training phases. The “hypothetical process to generate agents which are coherent under one beta” would be the first training phase, and then the “apply a different beta during training” would be the second training phase.
        Of course, the two-stage-training attempt to implement this has the problem that it only achieves approximate coherence. As a result, the malign hypothesis can use the reward-hacking strategy “appear to be $β_{t}$ -coherent & aligned in the first phase of training; appear to be $β_{s}$ -coherent and aligned in the second phase of training; then, do some other thing when deployed”. This relies on detecting the distributional shift between the two training phases (it can look for evidence of beta by examining the history), and also detecting distributional shift to deployment.
        This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta).
        So, while it remains technically true that V cannot reward-hack in the limit of arbitrarily long max-temperature training, it is not clear how well this holds up in practice. V can detect that it is in a high-temperature configuration by observing the state. The training needs to explore the trajectories that will actually be seen during deployment in order to correct V there, but some of these will have very low probability in the high-temperature training.
        J Bostock 7 Jan 2025 21:24 UTC
        2 points
        0
        Parent
        This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta).
        I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don’t think that’s relevant because I still think a process obeying these rules is unlikely to create such a pathological V.
        My model for how the strong doom-case works is that it requires there to be an actually-coherent mathematical object for the learning process to approach. This is the motivation for expecting arbitrary learning processes to approach e.g. utility maximizers. What I believe I have shown is that under these assumptions there is no such coherent mathematical object for a particular case of misalignment. Therefore I think this provides some evidence that an otherwise arbitrary learning process which pushes towards correctness and beta coherence but samples at a different beta is unlikely to approach this particular type of misaligned V.
        abramdemski 8 Jan 2025 1:22 UTC
        4 points
        0
        Parent
        I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don’t think that’s relevant because I still think a process obeying these rules is unlikely to create such a pathological V.
        To be clear, that’s not the argument I was trying to make; I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.
        J Bostock 8 Jan 2025 8:58 UTC
        2 points
        0
        Parent
        I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.
        All arguments break down a bit when introduced to the real world. Is there a particular reason why the approximation error to argument breakdown ratio should be particularly high in this case?
        Example, if we introduce some error to the beta-coherence assumption:
        Assume beta_t = 1, beta_s = 0.5, r_1 = 1, r_2 = 0.
        V(s_0) = e/(1+e) +/- delta = 0.732 +/- delta
        Actual expected value = 0.622
        Even if |delta| = 0.1 the system cannot be coherent over training in this case. This seems to be relatively robust to me
        abramdemski 8 Jan 2025 17:06 UTC
        2 points
        0
        Parent
        Yeah, of course the notion of “approximation error” matters a lot here. I was imagining trying to implement the hypothetical pre-training. An obvious implementation strategy is to do regular RL training with $β_{t}$ until V is approximately coherent. That’s the pre-training. And then you switch to training with $β_{s}$ .^[1] The natural notion of approximation error in this case is the one you can easily measure, IE, the expected distance from coherence of states sampled with temperature $β_{t}$ . This reflects the fact that it’ll be hard for this pre-training step to make states individually obey the coherence constraint when those states are very rarely or never sampled at temperature $β_{t}$ , but easy for frequently-sampled states.
        My argument, then, is that it is easy in principle for V to distinguish whether it is in pre-training or regular training when it encounters states which have a significant difference in probability when we’re sampling according to $β_{t}$ or $β_{s}$ . A scheming V can utilize this to self-preserve. This violates the assumption of $β_{t}$ -coherence, but in a very plausible-seeming way.
        ^
        My earlier comment about this mistakenly used $β_{1}$ and $β_{2}$ in place of $β_{t}$ and $β_{s}$ , which may have been confusing. I’ll go fix that to be consistent with your notation.
        J Bostock 8 Jan 2025 19:45 UTC
        2 points
        0
        Parent
        I haven’t actually thought much about particular training algorithms yet. I think I’m working on a higher level of abstraction than that at the moment, since my maths doesn’t depend on any specifics about V’s behaviour. I do expect that in practice an already-scheming V would be able to escape some finite-time reasonable-beta-difference situations like this, with partial success.
        I’m also imagining that during training, V is made up of different circuits which might be reinforced or weakened.
        My view is that, if V is shaped by a training process like this, then scheming Vs are no longer a natural solution in the same way that they are in the standard view of deceptive alignment. We might be able to use this maths to construct training procedures where the expected importance of a scheming circuit in V is to become (weakly) weaker over time, rather than being reinforced.
        If we do that for the entire training process, we would not expect to end up with a scheming V.
        The question is which RL and inference paradigms approximate this. I suspect it might be a relatively large portion of them. I think that if this work is relevant to alignment then there’s a >50% chance it’s already factoring into the SOTA “alignment” techniques used by labs.