TurnTrout comments on New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

TurnTrout 3 Dec 2023 20:43 UTC
LW: 3 AF: 2
−4
AF
I mean, certainly there is a strong pressure to do well in training—that’s the whole point of training.
I disagree. This claim seems to be precisely what my original comment was critiquing:
It seems to me like you’re positing some “need to do well in training”, which is… a kinda weird frame. In a weak correlational sense, it’s true that loss tends to decrease over training-time and research-time.
But I disagree with an assumption of “in the future, the trained model will need to achieve loss scores so perfectly low that the model has to e.g. ‘try to do well’ or reason about its own training process in order to get a low enough loss, otherwise the model gets ‘selected against’.”
And then you wrote, as some things you believe:^[1]
The model needs to figure out how to somehow output a distribution that does well in training...
Doing that is quite hard for the distributions that we care about and requires a ton of cognition and reasoning in any situation where you don’t just get complete memorization (which is highly unlikely under the inductive biases)...
Both the deceptive and sycophantic models involve directly reasoning about the training process internally to figure out how to do well on it. The aligned model likely also requires some reasoning about the training process, but only indirectly due to understanding the world being important and the training process being a part of the world.
This is the kind of claim I was critiquing in my original comment!
but you shouldn’t equate them into one group and say it’s a motte-and-bailey. Different people just think different things.
My model of you makes both claims, and I indeed see signs of both kinds of claims in your comments here.
1. ^
  Thanks for writing out a bunch of things you believe, by the way! That was helpful.
- evhub 3 Dec 2023 21:18 UTC
  LW: 9 AF: 6
  3
  AF Parent
  I am very confused now what you believe. Obviously training selects for low loss algorithms… that’s, the whole point of training? I thought you were saying that training doesn’t select for algorithms that internally optimize for loss, which is true, but it definitely does select for algorithms that in fact get low loss.
  - TurnTrout 26 Dec 2023 20:25 UTC
    LW: 2 AF: 2
    0
    AF Parent
    The point of training in a practical sense is generally to produce networks with desirable behavior. The point of training in a dynamical sense is to perform an optimizer-mediated update to locally reduce loss along the locally steepest direction, aggregating gradients over different subsets of the data.
    What is the empirical content of the claim that “training selects for low loss algorithms”? Can you make it more precise, perhaps by tabooing “selects for”?
  - quetzal_rainbow 3 Dec 2023 21:43 UTC
    LW: 2 AF: 2
    −4
    AF Parent
    I place here prediction that TurnTrout is trying to say that while, counterfactally, if we had algorithm that reasons about training, it would achieve low loss, it’s not obviously true that such algorithms are actually “achievable” for SGD in some “natural” setting.
    - evhub 3 Dec 2023 23:15 UTC
      LW: 2 AF: 2
      0
      AF Parent
      That’s what I thought he was saying previously, but he objected to that characterization in his most recent comment.
      - Signer 7 Dec 2023 8:12 UTC
        LW: 3 AF: 3
        0
        AF Parent
        Wait, where? I think the objection to “Doing that is quite hard” is not an objection to “it’s not obviously true that such algorithms are actually “achievable” for SGD”—it’s an objection to the conclusion that model would try hard enough to justify arguments about deception from weak statement about loss decreasing during training.
        TurnTrout 26 Dec 2023 20:30 UTC
        LW: 2 AF: 2
        0
        AF Parent
        an objection to the conclusion that model would try hard enough to justify arguments about deception from weak statement about loss decreasing during training.
        This is… roughly one point I was making, yes.