TurnTrout comments on New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

TurnTrout 20 Nov 2023 21:59 UTC
LW: 4 AF: 3
−2
AF
It seems to me like you’re positing some “need to do well in training”, which is… a kinda weird frame. In a weak correlational sense, it’s true that loss tends to decrease over training-time and research-time.
No, I don’t think I’m positing that—in fact, I said that the aligned model doesn’t do this.
I don’t understand why you claim to not be doing this. Probably we misunderstand each other? You do seem to be incorporating a “(strong) pressure to do well in training” in your reasoning about what gets trained. You said (emphasis added):
The argument for deceptive alignment is that deceptive alignment might be the easiest way for the model to figure out how to do well in training.
This seems to be engaging in the kind of reasoning I’m critiquing.
We now have two dual optimization problems, “minimize loss subject to some level of inductive biases” and “maximize inductive biases subject to some level of loss” which we can independently investigate to produce evidence about the original joint optimization problem.
Sure, this (at first pass) seems somewhat more reasonable, in terms of ways of thinking about the problem. But I don’t think the vast majority of “loss-minimizing” reasoning actually involves this more principled analysis. Before now, I have never heard anyone talk about this frame, or any other recovery which I find satisfying.
So this feels like a motte-and-bailey, where the strong-and-common claim goes like “we’re selecting models to minimize loss, and so if deceptive models get lower loss, that’s a huge problem; let’s figure out how to not make that be true” and the defensible-but-weak-and-rare claim is “by considering loss minimization given certain biases, we can gain evidence about what kinds of functions SGD tends to train.”
- evhub 21 Nov 2023 1:09 UTC
  LW: 9 AF: 5
  2
  AF Parent
  You do seem to be incorporating a “(strong) pressure to do well in training” in your reasoning about what gets trained.
  
  I mean, certainly there is a strong pressure to do well in training—that’s the whole point of training. What there isn’t strong pressure for is for the model to internally be trying to figure out how to do well in training. The model need not be thinking about training at all to do well on the training objective, e.g. as in the aligned model.
  
  To be clear, here are some things that I think:
  - The model needs to figure out how to somehow output a distribution that does well in training. Exactly how well relative to the inductive biases is unclear, but generally I think the easiest way to think about this is to take performance at the level you expect of powerful future models as a constraint.
  - There are many algorithms which result in outputting a distribution that does well in training. Some of those algorithms directly reason about the training process, whereas some do not.
  - Taking training performance as a constraint, the question is what is the easiest way (from an inductive bias perspective) to produce such a distribution.
  - Doing that is quite hard for the distributions that we care about and requires a ton of cognition and reasoning in any situation where you don’t just get complete memorization (which is highly unlikely under the inductive biases).
  - Both the deceptive and sycophantic models involve directly reasoning about the training process internally to figure out how to do well on it. The aligned model likely also requires some reasoning about the training process, but only indirectly due to understanding the world being important and the training process being a part of the world.
  - Comparing the deceptive to sycophantic models, the primary question is which one is an easier way (from an inductive bias perspective) to compute how to do well on the training process: directly memorizing pointers to that information in the world model, or deducing that information using the world model based on some goal.
  I have never heard anyone talk about this frame
  
  I think probably that’s just because you haven’t talked to me much about this. The point about whether to use a loss minimization + inductive bias constraint vs. loss constraint + inductive bias minimization was a big one that I commented a bunch about on Joe’s report. In fact, I suspect he’d probably have some more thoughts here on this—I think he’s not fully sold on my framing above.
  
  So this feels like a motte-and-bailey
  
  I agree that there are some people that might defend different claims than I would, but I don’t think I should be responsible for those claims. Part of why I’m excited about Joe’s report is that it takes a bunch of different isolated thinking from different people and puts it into a single coherent position, so it’s easier to evaluate that position in totality. If you have disagreements with my position, with Joe’s position, or with anyone else’s position, that’s obviously totally fine—but you shouldn’t equate them into one group and say it’s a motte-and-bailey. Different people just think different things.
  - TurnTrout 3 Dec 2023 20:43 UTC
    LW: 3 AF: 2
    −4
    AF Parent
    I mean, certainly there is a strong pressure to do well in training—that’s the whole point of training.
    I disagree. This claim seems to be precisely what my original comment was critiquing:
    It seems to me like you’re positing some “need to do well in training”, which is… a kinda weird frame. In a weak correlational sense, it’s true that loss tends to decrease over training-time and research-time.
    But I disagree with an assumption of “in the future, the trained model will need to achieve loss scores so perfectly low that the model has to e.g. ‘try to do well’ or reason about its own training process in order to get a low enough loss, otherwise the model gets ‘selected against’.”
    And then you wrote, as some things you believe:^[1]
    The model needs to figure out how to somehow output a distribution that does well in training...
    Doing that is quite hard for the distributions that we care about and requires a ton of cognition and reasoning in any situation where you don’t just get complete memorization (which is highly unlikely under the inductive biases)...
    Both the deceptive and sycophantic models involve directly reasoning about the training process internally to figure out how to do well on it. The aligned model likely also requires some reasoning about the training process, but only indirectly due to understanding the world being important and the training process being a part of the world.
    This is the kind of claim I was critiquing in my original comment!
    but you shouldn’t equate them into one group and say it’s a motte-and-bailey. Different people just think different things.
    My model of you makes both claims, and I indeed see signs of both kinds of claims in your comments here.
    ^
    Thanks for writing out a bunch of things you believe, by the way! That was helpful.
    - evhub 3 Dec 2023 21:18 UTC
      LW: 9 AF: 6
      3
      AF Parent
      I am very confused now what you believe. Obviously training selects for low loss algorithms… that’s, the whole point of training? I thought you were saying that training doesn’t select for algorithms that internally optimize for loss, which is true, but it definitely does select for algorithms that in fact get low loss.
      - TurnTrout 26 Dec 2023 20:25 UTC
        LW: 2 AF: 2
        0
        AF Parent
        The point of training in a practical sense is generally to produce networks with desirable behavior. The point of training in a dynamical sense is to perform an optimizer-mediated update to locally reduce loss along the locally steepest direction, aggregating gradients over different subsets of the data.
        What is the empirical content of the claim that “training selects for low loss algorithms”? Can you make it more precise, perhaps by tabooing “selects for”?
      - quetzal_rainbow 3 Dec 2023 21:43 UTC
        LW: 2 AF: 2
        −4
        AF Parent
        I place here prediction that TurnTrout is trying to say that while, counterfactally, if we had algorithm that reasons about training, it would achieve low loss, it’s not obviously true that such algorithms are actually “achievable” for SGD in some “natural” setting.
        evhub 3 Dec 2023 23:15 UTC
        LW: 2 AF: 2
        0
        AF Parent
        That’s what I thought he was saying previously, but he objected to that characterization in his most recent comment.
        Signer 7 Dec 2023 8:12 UTC
        LW: 3 AF: 3
        0
        AF Parent
        Wait, where? I think the objection to “Doing that is quite hard” is not an objection to “it’s not obviously true that such algorithms are actually “achievable” for SGD”—it’s an objection to the conclusion that model would try hard enough to justify arguments about deception from weak statement about loss decreasing during training.
        TurnTrout 26 Dec 2023 20:30 UTC
        LW: 2 AF: 2
        0
        AF Parent
        an objection to the conclusion that model would try hard enough to justify arguments about deception from weak statement about loss decreasing during training.
        This is… roughly one point I was making, yes.