Reposting my response on Twitter (To clarify, the following was originally written as a Tweet in response to Joe Carlsmith’s Tweet about the paper, which I am now reposting here):
I just skimmed the section headers and a small amount of the content, but I’m extremely skeptical. E.g., the “counting argument” seems incredibly dubious to me because you can just as easily argue that text to image generators will internally create images of llamas in their early layers, which they then delete, before creating the actual asked for image in the later layers. There are many possible llama images, but “just one” network that straightforwardly implements the training objective, after all.
The issue is that this isn’t the correct way to do counting arguments on NN configurations. While there are indeed an exponentially large number of possible llama images that an NN might create internally, there are an even more exponentially large number of NNs that have random first layers, and then go on to do the actual thing in the later layers. Thus, the “inner llamaizers” are actually more rare in NN configuration space than the straightforward NN.
The key issue is that each additional computation you speculate an NN might be doing acts as an additional constraint on the possible parameters, since the NN has to internally contain circuits that implement those computations. The constraint that the circuits actually have to do “something” is a much stronger reduction in the number of possible configurations for those parameters than any additional configurations you can get out of there being multiple “somethings” that the circuits might be doing.
So in the case of deceptive alignment counting arguments, they seem to be speculating that the NN’s cognition looks something like:
[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)] [figure out how to do well at training] [actually do well at training]
and in comparison, the “honest” / direct solution looks like:
[figure out how to do well at training] [actually do well at training]
and then because there are so many different possibilities for “x”, they say there are more solutions that look like the deceptive cognition. My contention is that the steps “[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)]” in the deceptive cognition are actually unnecessary, and because implementing those steps requires that one have circuits that instantiate those computations, the requirement that the deceptive model perform those steps actually *constrains* the number of parameter configurations that implement the deceptive cognition, which reduces the volume of deceptive models in parameter space.
One obvious counterpoint I expect is to claim that the “[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)]” steps actually do contribute to the later steps, maybe because they’re a short way to compress a motivational pointer to “wanting” to do well on the training objective.
I don’t think this is how NN simplicity biases work. Under the “cognitive executions impose constraints on parameter settings” perspective, you don’t actually save any complexity by supposing that the model has some motive for figuring stuff out internally, because the circuits required to implement the “figure stuff out internally” computations themselves count as additional complexity. In contrast, if you have a view of simplicity that’s closer to program description length, then you’re not counting runtime execution against program complexity, and so a program that has short length in code but long runtime can count as simple.
@jkcarlsmith does seem more sympathetic to the “parameters as complexity” view than to the “code length as complexity” view. However, I go further and think that the “parameters as complexity” view actively points against deceptive alignment.
I also think NNs have an even stronger bias for short paths than you might expect from just the “parameters as complexity” view. Consider a problem that can either be solved with a single circuit of depth n, or by two parallel circuits of depth n/2 (so both solutions must set the values of an equal number of parameters, but one solution is more parallel than the other). I claim there are far more parameter configurations that implement the parallel solution than parameter configurations that implement the serial solution.
This is because the parallel solution has an entire permutation group that’s not available to the serial solution: the two parallel circuits can be moved to different relative depths with respect to each other, whereas all the parts of the serial solution must have fixed relative depths. Thus, the two parallel circuits represent less of a constraint on the possible configurations of the NN, and so there are far more NNs that implement the parallel solution.
As a consequence, I expect there are significant “short depth” biases in the NN simplicity prior, consistent with empirical results such as: https://arxiv.org/abs/1605.06431
Finally, I’m extremely skeptical of claims that NNs contain a ‘ghost of generalized instrumental reasoning’, able to perform very competent long term hidden scheming and deduce lots of world knowledge “in-context”. I think current empirical results point strongly against that being plausible.
For example, the “reversal curse” results (training on “A is B” doesn’t lead to models learning “B is A”). If the ghost can’t even infer from “A is B” to “B is A”, then I think stuff like inferring from “I have a goal x”, to “here is the specific task I must perform in order to maximize my reward” is pretty much out of the question. Thus, stories about how SGD might use arbitrary goals as a way to efficiently compress an (effective) desire for the NN to silently infer lots of very specific details about the training process seem incredibly implausible to me.
I expect objections of the form “I expect future training processes to not suffer from the reversal curse, and I’m worried about the future training processes.”
Obviously people will come up with training processes that don’t suffer from the reversal curse. However, comparing the simplicity of the reversal curse to the capability of current NNs is still evidence about the relative power of the ‘instrumental ghost’ in the model compared to the external capabilities of the model. If a similar ratio continues to hold for externally superintelligent AIs, then that casts enormous doubt on e.g., deceptive alignment scenarios where the model is internally and instrumentally deriving huge amounts of user-goal-related knowledge so that it can pursue its arbitrary mesaobjectives later down the line. I’m using the reversal curse to make a more generalized update about the types of internal cognition that are easy to learn and how they contribute to external capabilities.
Some other Tweets I wrote as part of the discussion:
The key points of my Tweet are basically “the better way to think about counting arguments is to compare constraints on parameter configurations”, and “corrected counting arguments introduce an implicit bias towards short, parallel solutions”, where both “counting the constrained parameters”, and “counting the permutations of those parameters” point in that direction.
I think shallow depth priors are pretty universal. E.g., they also make sense from a perspective of “any given step of reasoning could fail, so best to make as few sequential steps as possible, since each step is rolling the dice”, as well as a perspective of “we want to explore as many hypotheses as possible with as little compute as possible, so best have lots of cheap hypotheses”.
I’m not concerned about the training for goal achievement contributing to deceptive alignment, because such training processes ultimately come down to optimizing the model to imitate some mapping from “goal given by the training process” → “externally visible action sequence”. Feedback is always upweighting cognitive patterns that produce some externally visible action patterns (usually over short time horizons).
In contrast, it seems very hard to me to accidentally provide sufficient feedback to specify long-term goals that don’t distinguish themselves from short term one over short time horizons, given the common understanding in RL that credit assignment difficulties actively work against the formation of long term goals. It seems more likely to me that we’ll instill long term goals into AIs by “scaffolding” them via feedback over shorter time horizons. E.g., train GPT-N to generate text like “the company’s stock must go up” (short time horizon feedback), as well as text that represents GPT-N competently responding to a variety of situations and discussions about how to achieve long-term goals (more short time horizon feedback), and then putting GPT-N in a continuous loop of sampling from a combination of the behavioral patterns thereby constructed, in such a way that the overall effect is competent long term planning.
The point is: long term goals are sufficiently hard to form deliberately that I don’t think they’ll form accidentally.
...I think the llama analogy is exactly correct. It’s specifically designed to avoid triggering mechanistically ungrounded intuitions about “goals” and “tryingness”, which I think inappropriately upweight the compellingness of a conclusion that’s frankly ridiculous on the arguments themselves. Mechanistically, generating the intermediate llamas is just as causally upstream of generating the asked for images, as “having an inner goal” is causally upstream of the deceptive model doing well on the training objective. Calling one type of causal influence “trying” and the other not is an arbitrary distinction.
My point about the “instrumental ghost” wasn’t that NNs wouldn’t learn instrumental / flexible reasoning. It was that such capabilities were much more likely to derive from being straightforwardly trained to learn such capabilities, and then to be employed in a manner consistent with the target function of the training process. What I’m arguing *against* is the perspective that NNs will “accidentally” acquire such capabilities internally as a convergent result of their inductive biases, and direct them to purposes/along directions very different from what’s represented in the training data. That’s the sort of stuff I was talking about when I mentioned the “ghost”.
What I’m saying is there’s a difference between a model that can do flexible instrumental reasoning because it’s faithfully modeling a data distribution with examples of flexible instrumental reasoning, versus a model that acquired hidden flexible instrumental reasoning because NN inductive biases say the convergent best way to do well on tasks is to acquire hidden flexible instrumental reasoning and apply it to the task, even when the task itself doesn’t have any examples of such.
I really don’t like all this discussion happening on Twitter, and I appreciate that you took the time to move this back to LW/AF instead. I think Twitter is really a much worse forum for talking about complex issues like this than LW/AF.
Regardless, some quick thoughts:
[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)] [figure out how to do well at training] [actually do well at training]
and in comparison, the “honest” / direct solution looks like:
[figure out how to do well at training] [actually do well at training]
I think this is a mischaracterization of the argument. The argument for deceptive alignment is that deceptive alignment might be the easiest way for the model to figure out how to do well in training. So a more accurate comparison would be:
Deceptive model:
[figure out how to do well at training] [actually do well at training]
Sycophantic model:
[figure out how to do well at training] [actually do well at training]
Aligned model:
[figure out how to be aligned] [actually be aligned]
Notably, the deceptive and sycophantic models are the same! But the difference is that they look different when we break apart the “figure out how to do well at training” part. We could do the same breakdown for the sycophantic model, which might look something like:
Sycophantic model:
[load in some hard-coded specification of what it means to do well in training] [figure out how to execute on that specification in this environment] [actually do well at training]
The problem is that figuring out how to do well at training is actually quite hard, and deceptive alignment might make that problem easier by reducing it to the (potentially) simpler/easier problem of figuring out how to accomplish <insert any long-term goal here>. Whereas the sycophantic model just has to memorize a bunch of stuff about training that the deceptive model doesn’t have to.
The point is that you can’t just say “well, deceptive alignment results in the model trying to do well in training, so why not just learn a model that starts by trying to do well in training” for the same reason that you can’t just say “well, deceptive alignment results in the model outputting this specific distribution, so why not just learn a model that memorizes that exact distribution”. The entire question is about what the easiest way is to produce that distribution in terms of the inductive biases.
Also, another point that I’d note here: the sycophantic model isn’t actually desirable either! So long as the deceptive model beats the aligned model in terms of the inductive biases, it’s still a concern, regardless of whether it beats the sycophantic model or not. I’m pretty unsure which is more likely between the deceptive and sycophantic models, but I think both pretty likely beat the aligned model in most cases that we care about. But I’m more optimistic that we can find ways to address sycophantic models than deceptive models, such that I think the deceptive models are more of a concern.
The problem is that figuring out how to do well at training is actually quite hard[1]
It seems to me like you’re positing some “need to do well in training”, which is… a kinda weird frame. In a weak correlational sense, it’s true that loss tends to decrease over training-time and research-time.
But I disagree with an assumption of “in the future, the trained model will need toachieve loss scores so perfectly low that the model has to e.g. ‘try to do well’ or reason about its own training process in order to get a low enough loss, otherwise the model gets ‘selected against’.” (It seems to me that you are making this assumption; let me know if you are not.) (ETA: Evan doesn’t think that it’s necessary to do this in order to be selected by training. In particular, he wrote that ‘aligned models’ won’t need to do this.)
I don’t know why so many people seem to think model training works like this. That is, that one can:
verbally reason about whether a postulated algorithm “minimizes loss” (example algorithm: using a bunch of interwoven heuristics to predict text), and then
brainstorm English descriptions of algorithms which, according to us, get even lower loss (example algorithm: reason out the whole situation every forward pass, and figure out what would minimize loss on the next token), and then
since algorithm-2 gets “lower loss” (as reasoned about informally in English), one is licensed to conclude that SGD is incentivized pick the latter algorithm.
I think that loss just does not constrain training that tightly, or in that fashion.[2] I can give a range of counterevidence, from early stopping (done in practice to use compute effectively) to knowledge distillation (shows that for a given level of expressivity, training on a larger teacher model’s logits will achieve substantially lower loss than supervised training to convergence from random initialization, which shows that training to convergence isn’t even well-described as “minimizing loss”).
And I’m not aware of good theoretical bounds here either; the cutting-edge PAC-Bayes results are, like, bounding MNIST test error to 2.7% on an empirical setup which got 2%. That’s a big and cool theoretical achievement, but—if my understanding of theory SOTA is correct—we definitely don’t have the theoretical precision to be confidently reasoning about loss minimization like this on that basis.
I feel like this unsupported assumption entered the groundwater somehow and now looms behind lots of alignment reasoning. I don’t know where it comes from. On the off-chance it’s actually well-founded, I’d deeply appreciate an explanation or link.
FWIW I think this claim runs afoul of what I was trying to communicate in reward is not the optimization target. (I mention this since it’s relevant to a discussion we had last year, about how many people already understood what I was trying to convey.)
I also see no reason to expect this to be a good “conservative worst-case”, and this is why I’m so leery of worst-case analysis a la ELK. I see little reason that reasoning this way will be helpful in reality.
It seems to me like you’re positing some “need to do well in training”, which is… a kinda weird frame. In a weak correlational sense, it’s true that loss tends to decrease over training-time and research-time.
No, I don’t think I’m positing that—in fact, I said that the aligned model doesn’t do this.
I feel like this unsupported assumption entered the groundwater somehow and now looms behind lots of alignment reasoning. I don’t know where it comes from. On the off-chance it’s actually well-founded, I’d deeply appreciate an explanation or link.
I do think this is a fine way to reason about things. Here’s how I would justify this: We know that SGD is selecting for models based on some combination of loss and inductive biases, but we don’t know the exact tradeoff. We could just try to directly theorize about the multivariate optimization problem, but that’s quite difficult. Instead, we can take either variable as a constraint, and theorize about the univariate optimization problem subject to that constraint. We now have two dual optimization problems, “minimize loss subject to some level of inductive biases” and “maximize inductive biases subject to some level of loss” which we can independently investigate to produce evidence about the original joint optimization problem.
It seems to me like you’re positing some “need to do well in training”, which is… a kinda weird frame. In a weak correlational sense, it’s true that loss tends to decrease over training-time and research-time.
No, I don’t think I’m positing that—in fact, I said that the aligned model doesn’t do this.
I don’t understand why you claim to not be doing this. Probably we misunderstand each other? You do seem to be incorporating a “(strong) pressure to do well in training” in your reasoning about what gets trained. You said (emphasis added):
The argument for deceptive alignment is that deceptive alignment might be the easiest way for the model to figure out how to do well in training.
This seems to be engaging in the kind of reasoning I’m critiquing.
We now have two dual optimization problems, “minimize loss subject to some level of inductive biases” and “maximize inductive biases subject to some level of loss” which we can independently investigate to produce evidence about the original joint optimization problem.
Sure, this (at first pass) seems somewhat more reasonable, in terms of ways of thinking about the problem. But I don’t think the vast majority of “loss-minimizing” reasoning actually involves this more principled analysis. Before now, I have never heard anyone talk about this frame, or any other recovery which I find satisfying.
So this feels like a motte-and-bailey, where the strong-and-common claim goes like “we’re selecting models to minimize loss, and so if deceptive models get lower loss, that’s a huge problem; let’s figure out how to not make that be true” and the defensible-but-weak-and-rare claim is “by considering loss minimization given certain biases, we can gain evidence about what kinds of functions SGD tends to train.”
You do seem to be incorporating a “(strong) pressure to do well in training” in your reasoning about what gets trained.
I mean, certainly there is a strong pressure to do well in training—that’s the whole point of training. What there isn’t strong pressure for is for the model to internally be trying to figure out how to do well in training. The model need not be thinking about training at all to do well on the training objective, e.g. as in the aligned model.
To be clear, here are some things that I think:
The model needs to figure out how to somehow output a distribution that does well in training. Exactly how well relative to the inductive biases is unclear, but generally I think the easiest way to think about this is to take performance at the level you expect of powerful future models as a constraint.
There are many algorithms which result in outputting a distribution that does well in training. Some of those algorithms directly reason about the training process, whereas some do not.
Taking training performance as a constraint, the question is what is the easiest way (from an inductive bias perspective) to produce such a distribution.
Doing that is quite hard for the distributions that we care about and requires a ton of cognition and reasoning in any situation where you don’t just get complete memorization (which is highly unlikely under the inductive biases).
Both the deceptive and sycophantic models involve directly reasoning about the training process internally to figure out how to do well on it. The aligned model likely also requires some reasoning about the training process, but only indirectly due to understanding the world being important and the training process being a part of the world.
Comparing the deceptive to sycophantic models, the primary question is which one is an easier way (from an inductive bias perspective) to compute how to do well on the training process: directly memorizing pointers to that information in the world model, or deducing that information using the world model based on some goal.
I have never heard anyone talk about this frame
I think probably that’s just because you haven’t talked to me much about this. The point about whether to use a loss minimization + inductive bias constraint vs. loss constraint + inductive bias minimization was a big one that I commented a bunch about on Joe’s report. In fact, I suspect he’d probably have some more thoughts here on this—I think he’s not fully sold on my framing above.
So this feels like a motte-and-bailey
I agree that there are some people that might defend different claims than I would, but I don’t think I should be responsible for those claims. Part of why I’m excited about Joe’s report is that it takes a bunch of different isolated thinking from different people and puts it into a single coherent position, so it’s easier to evaluate that position in totality. If you have disagreements with my position, with Joe’s position, or with anyone else’s position, that’s obviously totally fine—but you shouldn’t equate them into one group and say it’s a motte-and-bailey. Different people just think different things.
I mean, certainly there is a strong pressure to do well in training—that’s the whole point of training.
I disagree. This claim seems to be precisely what my original comment was critiquing:
It seems to me like you’re positing some “need to do well in training”, which is… a kinda weird frame. In a weak correlational sense, it’s true that loss tends to decrease over training-time and research-time.
But I disagree with an assumption of “in the future, the trained model will need toachieve loss scores so perfectly low that the model has to e.g. ‘try to do well’ or reason about its own training process in order to get a low enough loss, otherwise the model gets ‘selected against’.”
And then you wrote, as some things you believe:[1]
The model needs to figure out how to somehow output a distribution that does well in training...
Doing that is quite hard for the distributions that we care about and requires a ton of cognition and reasoning in any situation where you don’t just get complete memorization (which is highly unlikely under the inductive biases)...
Both the deceptive and sycophantic models involve directly reasoning about the training process internally to figure out how to do well on it. The aligned model likely also requires some reasoning about the training process, but only indirectly due to understanding the world being important and the training process being a part of the world.
This is the kind of claim I was critiquing in my original comment!
but you shouldn’t equate them into one group and say it’s a motte-and-bailey. Different people just think different things.
My model of you makes both claims, and I indeed see signs of both kinds of claims in your comments here.
I am very confused now what you believe. Obviously training selects for low loss algorithms… that’s, the whole point of training? I thought you were saying that training doesn’t select for algorithms that internally optimize for loss, which is true, but it definitely does select for algorithms that in fact get low loss.
The point of training in a practical sense is generally to produce networks with desirable behavior. The point of training in a dynamical sense is to perform an optimizer-mediated update to locally reduce loss along the locally steepest direction, aggregating gradients over different subsets of the data.
What is the empirical content of the claim that “training selects for low loss algorithms”? Can you make it more precise, perhaps by tabooing “selects for”?
I place here prediction that TurnTrout is trying to say that while, counterfactally, if we had algorithm that reasons about training, it would achieve low loss, it’s not obviously true that such algorithms are actually “achievable” for SGD in some “natural” setting.
Wait, where? I think the objection to “Doing that is quite hard” is not an objection to “it’s not obviously true that such algorithms are actually “achievable” for SGD”—it’s an objection to the conclusion that model would try hard enough to justify arguments about deception from weak statement about loss decreasing during training.
an objection to the conclusion that model would try hard enough to justify arguments about deception from weak statement about loss decreasing during training.
We know that SGD is selecting for models based on some combination of loss and inductive biases, but we don’t know the exact tradeoff.
Actually, I’ve thought more, and I don’t think that this dual-optimization perspective makes it better. I deny that we “know” that SGD is selecting on that combination, in the sense which seems to be required for your arguments to go through.
It sounds to me like I said “here’s why you can’t think about ‘what gets low loss’” and then you said[1] “but what if I also think about certain inductive biases too?” and then you also said “we know that it’s OK to think about it this way.” No, I contend that we don’t know that. That was a big part of what I was critiquing.
As an alert—It feels like your response here isn’t engaging with the points I raised in my original comment. I expect I talked past you and you, accordingly, haven’t seen the thing I’m trying to point at.
The question of how strongly training pressures models to minimize loss is one that I isolate and discuss explicitly in the report, in section 1.5, “On ‘slack’ in training”—and at various points the report references how differing levels of “slack” might affect the arguments it considers. Here I was influenced in part by discussions with various people, yourself included, who seemed to disagree about how much weight to put on arguments in the vein of: “policy A would get lower loss than policy B, so we should think it more likely that SGD selects policy A than policy B.”
(And for clarity, I don’t think that arguments of this form always support expecting models to do tons of reasoning about the training set-up. For example, as the report discusses in e.g. Section 4.4, on “speed arguments,” the amount of world-modeling/instrumental-reasoning that the model does can affect the loss it gets via e.g. using up cognitive resources. So schemers—and also, reward-on-the-episode seekers—can be at some disadvantage, in this respect, relative to models that don’t think about the training process at all.)
I really don’t like that you’ve taken this discussion to Twitter. I think Twitter is really a much worse forum for talking about complex issues like this than LW/AF.
I haven’t “taken this discussion to Twitter”. Joe Carlsmith posted about the paper on Twitter. I saw that post, and wrote my response on Twitter. I didn’t even know it was also posted on LW until later, and decided to repost the stuff I’d written on Twitter here. If anything, I’ve taken my part of the discussion from Twitter to LW. I’m slightly baffled and offended that you seem to be platform-policing me?
Anyways, it looks like you’re making the objection I predicted with the paragraphs:
One obvious counterpoint I expect is to claim that the “[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)]” steps actually do contribute to the later steps, maybe because they’re a short way to compress a motivational pointer to “wanting” to do well on the training objective.
I don’t think this is how NN simplicity biases work. Under the “cognitive executions impose constraints on parameter settings” perspective, you don’t actually save any complexity by supposing that the model has some motive for figuring stuff out internally, because the circuits required to implement the “figure stuff out internally” computations themselves count as additional complexity. In contrast, if you have a view of simplicity that’s closer to program description length, then you’re not counting runtime execution against program complexity, and so a program that has short length in code but long runtime can count as simple.
In particular, when I said “maybe because they’re a short way to compress a motivational pointer to “wanting” to do well on the training objective.” I think this is pointing at the same thing you reference when you say “The entire question is about what the easiest way is to produce that distribution in terms of the inductive biases.”
I.e., given the actual simplicity bias of models, what is the shortest (or most compressed) way of specifying “a model that starts by trying to do well in training”? And my response is that I think the model pays a complexity penalty for runtime computations (since they translate into constraints on parameter values which are needed to implement those computations). Even if those computations are motivated by something we call a “goal”, they still need to be implemented in the circuitry of the model, and thus also constrain its parameters.
Also, when I reference models whose internal cognition looks like “[figure out how to do well at training] [actually do well at training]”, I don’t have sycophantic models in particular in mind. It also includes aligned models, since those models do implement the “[figure out how to do well at training] [actually do well at training]” steps (assuming that aligned behavior does well in training).
If anything, I’ve taken my part of the discussion from Twitter to LW.
Good point. I think I’m misdirecting my annoyance here; I really dislike that there’s so much alignment discussion moving from LW to Twitter, but I shouldn’t have implied that you were responsible for that—and in fact I appreciate that you took the time to move this discussion back here. Sorry about that—I edited my comment.
And my response is that I think the model pays a complexity penalty for runtime computations (since they translate into constraints on parameter values which are needed to implement those computations). Even if those computations are motivated by something we call a “goal”, they still need to be implemented in the circuitry of the model, and thus also constrain its parameters.
Yes, I think we agree there. But that doesn’t imply that just because deceptive alignment is a way of calculating what the training process wants you to do, that you can then just memorize the result of that computation in the weights and thereby simplify the model—for the same reason SGD doesn’t memorize the entire distribution in the weights either.
I’m seeing your main argument here as a version of what I call, in section 4.4, a “speed argument against schemers”—e.g., basically, that SGD will punish the extra reasoning that schemers need to perform.
(I’m generally happy to talk about this reasoning as a complexity penalty, and/or about the params it requires, and/or about circuit-depth—what matters is the overall “preference” that SGD ends up with. And thinking of this consideration as a different kind of counting argument *against* schemers seems like it might well be a productive frame. I do also think that questions about whether models will be bottlenecked on serial computation, and/or whether “shallower” computations will be selected for, are pretty relevant here, and the report includes a rough calculation in this respect in section 4.4.2 (see also summary here).)
Indeed, I think that maybe the strongest single argument against scheming is a combination of
“Because of the extra reasoning schemers perform, SGD would prefer non-schemers over schemers in a comparison re: final properties of the models” and
“The type of path-dependence/slack at stake in training is such that SGD will get the model that it prefers overall.”
My sense is that I’m less confident than you in both (1) and (2), but I think they’re both plausible (the report, in particular, argues in favor of (1)), and that the combination is a key source of hope. I’m excited to see further work fleshing out the case for both (including e.g. the sorts of arguments for (2) that I took you and Nora to be gesturing at on twitter—the report doesn’t spend a ton of time on assessing how much path-dependence to expect, and of what kind).
Re: your discussion of the “ghost of instrumental reasoning,” “deducing lots of world knowledge ‘in-context,’ and “the perspective that NNs will ‘accidentally’ acquire such capabilities internally as a convergent result of their inductive biases”—especially given that you only skimmed the report’s section headings and a small amount of the content, I have some sense, here, that you’re responding to other arguments you’ve seen about deceptive alignment, rather than to specific claims made in the report (I don’t, for example, make any claims about world knowledge being derived “in-context,” or about models “accidentally” acquiring flexible instrumental reasoning). Is your basic thought something like: sure, the models will develop flexible instrumental reasoning that could in principle be used in service of arbitrary goals, but they will only in fact use it in service of the specified goal, because that’s the thing training pressures them to do? If so, my feeling is something like: ok, but a lot of the question here is whether using the instrumental reasoning in service of some other goal (one that backchains into getting-reward) will be suitably compatible with/incentivized by training pressures as well. And I don’t see e.g. the reversal curse as strong evidence on this front.
Re: “mechanistically ungrounded intuitions about ‘goals’ and ‘tryingness’”—as I discuss in section 0.1, the report is explicitly setting aside disputes about whether the relevant models will be well-understood as goal-directed (my own take on that is in section 2.2.1 of my report on power-seeking AI here). The question in this report is whether, conditional on goal-directedness, we should expect scheming. That said, I do think that what I call the “messyness” of the relevant goal-directedness might be relevant to our overall assessment of the arguments for scheming in various ways, and that scheming might require an unusually high standard of goal-directedness in some sense. I discuss this in section 2.2.3, on “‘Clean’ vs. ‘messy’ goal-directedness,” and in various other places in the report.
Re: “long term goals are sufficiently hard to form deliberately that I don’t think they’ll form accidentally”—the report explicitly discusses cases where we intentionally train models to have long-term goals (both via long episodes, and via short episodes aimed at inducing long-horizon optimization). I think scheming is more likely in those cases. See section 2.2.4, “What if you intentionally train the model to have long-term goals?” That said, I’d be interested to see arguments that credit assignment difficulties actively count against the development of beyond-episode goals (whether in models trained on short episodes or long episodes) for models that are otherwise goal-directed. And I do think that, if we could be confident that models trained on short episodes won’t learn beyond-episode goals accidentally (even irrespective of mundane adversarial training—e.g., that models rewarded for getting gold coins on the episode would not learn a goal that generalizes to caring about gold coins in general, even prior to efforts to punish it for sacrificing gold-coins-on-the-episode for gold-coins-later), that would be a significant source of comfort (I discuss some possible experimental directions in this respect in section 6.2).
Reposting my response on Twitter (To clarify, the following was originally written as a Tweet in response to Joe Carlsmith’s Tweet about the paper, which I am now reposting here):
Some other Tweets I wrote as part of the discussion:
Tweet 1:
Tweet 2:
Tweet 3:
Tweets 4 / 5:
I really don’t like all this discussion happening on Twitter, and I appreciate that you took the time to move this back to LW/AF instead. I think Twitter is really a much worse forum for talking about complex issues like this than LW/AF.
Regardless, some quick thoughts:
I think this is a mischaracterization of the argument. The argument for deceptive alignment is that deceptive alignment might be the easiest way for the model to figure out how to do well in training. So a more accurate comparison would be:
Deceptive model: [figure out how to do well at training] [actually do well at training]
Sycophantic model: [figure out how to do well at training] [actually do well at training]
Aligned model: [figure out how to be aligned] [actually be aligned]
Notably, the deceptive and sycophantic models are the same! But the difference is that they look different when we break apart the “figure out how to do well at training” part. We could do the same breakdown for the sycophantic model, which might look something like:
Sycophantic model: [load in some hard-coded specification of what it means to do well in training] [figure out how to execute on that specification in this environment] [actually do well at training]
The problem is that figuring out how to do well at training is actually quite hard, and deceptive alignment might make that problem easier by reducing it to the (potentially) simpler/easier problem of figuring out how to accomplish <insert any long-term goal here>. Whereas the sycophantic model just has to memorize a bunch of stuff about training that the deceptive model doesn’t have to.
The point is that you can’t just say “well, deceptive alignment results in the model trying to do well in training, so why not just learn a model that starts by trying to do well in training” for the same reason that you can’t just say “well, deceptive alignment results in the model outputting this specific distribution, so why not just learn a model that memorizes that exact distribution”. The entire question is about what the easiest way is to produce that distribution in terms of the inductive biases.
Also, another point that I’d note here: the sycophantic model isn’t actually desirable either! So long as the deceptive model beats the aligned model in terms of the inductive biases, it’s still a concern, regardless of whether it beats the sycophantic model or not. I’m pretty unsure which is more likely between the deceptive and sycophantic models, but I think both pretty likely beat the aligned model in most cases that we care about. But I’m more optimistic that we can find ways to address sycophantic models than deceptive models, such that I think the deceptive models are more of a concern.
It seems to me like you’re positing some “need to do well in training”, which is… a kinda weird frame. In a weak correlational sense, it’s true that loss tends to decrease over training-time and research-time.
But I disagree with an assumption of “in the future, the trained model will need to achieve loss scores so perfectly low that the model has to e.g. ‘try to do well’ or reason about its own training process in order to get a low enough loss, otherwise the model gets ‘selected against’.” (It seems to me that you are making this assumption; let me know if you are not.) (ETA: Evan doesn’t think that it’s necessary to do this in order to be selected by training. In particular, he wrote that ‘aligned models’ won’t need to do this.)
I don’t know why so many people seem to think model training works like this. That is, that one can:
verbally reason about whether a postulated algorithm “minimizes loss” (example algorithm: using a bunch of interwoven heuristics to predict text), and then
brainstorm English descriptions of algorithms which, according to us, get even lower loss (example algorithm: reason out the whole situation every forward pass, and figure out what would minimize loss on the next token), and then
since algorithm-2 gets “lower loss” (as reasoned about informally in English), one is licensed to conclude that SGD is incentivized pick the latter algorithm.
I think that loss just does not constrain training that tightly, or in that fashion.[2] I can give a range of counterevidence, from early stopping (done in practice to use compute effectively) to knowledge distillation (shows that for a given level of expressivity, training on a larger teacher model’s logits will achieve substantially lower loss than supervised training to convergence from random initialization, which shows that training to convergence isn’t even well-described as “minimizing loss”).
And I’m not aware of good theoretical bounds here either; the cutting-edge PAC-Bayes results are, like, bounding MNIST test error to 2.7% on an empirical setup which got 2%. That’s a big and cool theoretical achievement, but—if my understanding of theory SOTA is correct—we definitely don’t have the theoretical precision to be confidently reasoning about loss minimization like this on that basis.
I feel like this unsupported assumption entered the groundwater somehow and now looms behind lots of alignment reasoning. I don’t know where it comes from. On the off-chance it’s actually well-founded, I’d deeply appreciate an explanation or link.
FWIW I think this claim runs afoul of what I was trying to communicate in reward is not the optimization target. (I mention this since it’s relevant to a discussion we had last year, about how many people already understood what I was trying to convey.)
I also see no reason to expect this to be a good “conservative worst-case”, and this is why I’m so leery of worst-case analysis a la ELK. I see little reason that reasoning this way will be helpful in reality.
No, I don’t think I’m positing that—in fact, I said that the aligned model doesn’t do this.
I do think this is a fine way to reason about things. Here’s how I would justify this: We know that SGD is selecting for models based on some combination of loss and inductive biases, but we don’t know the exact tradeoff. We could just try to directly theorize about the multivariate optimization problem, but that’s quite difficult. Instead, we can take either variable as a constraint, and theorize about the univariate optimization problem subject to that constraint. We now have two dual optimization problems, “minimize loss subject to some level of inductive biases” and “maximize inductive biases subject to some level of loss” which we can independently investigate to produce evidence about the original joint optimization problem.
I don’t understand why you claim to not be doing this. Probably we misunderstand each other? You do seem to be incorporating a “(strong) pressure to do well in training” in your reasoning about what gets trained. You said (emphasis added):
This seems to be engaging in the kind of reasoning I’m critiquing.
Sure, this (at first pass) seems somewhat more reasonable, in terms of ways of thinking about the problem. But I don’t think the vast majority of “loss-minimizing” reasoning actually involves this more principled analysis. Before now, I have never heard anyone talk about this frame, or any other recovery which I find satisfying.
So this feels like a motte-and-bailey, where the strong-and-common claim goes like “we’re selecting models to minimize loss, and so if deceptive models get lower loss, that’s a huge problem; let’s figure out how to not make that be true” and the defensible-but-weak-and-rare claim is “by considering loss minimization given certain biases, we can gain evidence about what kinds of functions SGD tends to train.”
I mean, certainly there is a strong pressure to do well in training—that’s the whole point of training. What there isn’t strong pressure for is for the model to internally be trying to figure out how to do well in training. The model need not be thinking about training at all to do well on the training objective, e.g. as in the aligned model.
To be clear, here are some things that I think:
The model needs to figure out how to somehow output a distribution that does well in training. Exactly how well relative to the inductive biases is unclear, but generally I think the easiest way to think about this is to take performance at the level you expect of powerful future models as a constraint.
There are many algorithms which result in outputting a distribution that does well in training. Some of those algorithms directly reason about the training process, whereas some do not.
Taking training performance as a constraint, the question is what is the easiest way (from an inductive bias perspective) to produce such a distribution.
Doing that is quite hard for the distributions that we care about and requires a ton of cognition and reasoning in any situation where you don’t just get complete memorization (which is highly unlikely under the inductive biases).
Both the deceptive and sycophantic models involve directly reasoning about the training process internally to figure out how to do well on it. The aligned model likely also requires some reasoning about the training process, but only indirectly due to understanding the world being important and the training process being a part of the world.
Comparing the deceptive to sycophantic models, the primary question is which one is an easier way (from an inductive bias perspective) to compute how to do well on the training process: directly memorizing pointers to that information in the world model, or deducing that information using the world model based on some goal.
I think probably that’s just because you haven’t talked to me much about this. The point about whether to use a loss minimization + inductive bias constraint vs. loss constraint + inductive bias minimization was a big one that I commented a bunch about on Joe’s report. In fact, I suspect he’d probably have some more thoughts here on this—I think he’s not fully sold on my framing above.
I agree that there are some people that might defend different claims than I would, but I don’t think I should be responsible for those claims. Part of why I’m excited about Joe’s report is that it takes a bunch of different isolated thinking from different people and puts it into a single coherent position, so it’s easier to evaluate that position in totality. If you have disagreements with my position, with Joe’s position, or with anyone else’s position, that’s obviously totally fine—but you shouldn’t equate them into one group and say it’s a motte-and-bailey. Different people just think different things.
I disagree. This claim seems to be precisely what my original comment was critiquing:
And then you wrote, as some things you believe:[1]
This is the kind of claim I was critiquing in my original comment!
My model of you makes both claims, and I indeed see signs of both kinds of claims in your comments here.
Thanks for writing out a bunch of things you believe, by the way! That was helpful.
I am very confused now what you believe. Obviously training selects for low loss algorithms… that’s, the whole point of training? I thought you were saying that training doesn’t select for algorithms that internally optimize for loss, which is true, but it definitely does select for algorithms that in fact get low loss.
The point of training in a practical sense is generally to produce networks with desirable behavior. The point of training in a dynamical sense is to perform an optimizer-mediated update to locally reduce loss along the locally steepest direction, aggregating gradients over different subsets of the data.
What is the empirical content of the claim that “training selects for low loss algorithms”? Can you make it more precise, perhaps by tabooing “selects for”?
I place here prediction that TurnTrout is trying to say that while, counterfactally, if we had algorithm that reasons about training, it would achieve low loss, it’s not obviously true that such algorithms are actually “achievable” for SGD in some “natural” setting.
That’s what I thought he was saying previously, but he objected to that characterization in his most recent comment.
Wait, where? I think the objection to “Doing that is quite hard” is not an objection to “it’s not obviously true that such algorithms are actually “achievable” for SGD”—it’s an objection to the conclusion that model would try hard enough to justify arguments about deception from weak statement about loss decreasing during training.
This is… roughly one point I was making, yes.
Actually, I’ve thought more, and I don’t think that this dual-optimization perspective makes it better. I deny that we “know” that SGD is selecting on that combination, in the sense which seems to be required for your arguments to go through.
It sounds to me like I said “here’s why you can’t think about ‘what gets low loss’” and then you said[1] “but what if I also think about certain inductive biases too?” and then you also said “we know that it’s OK to think about it this way.” No, I contend that we don’t know that. That was a big part of what I was critiquing.
As an alert—It feels like your response here isn’t engaging with the points I raised in my original comment. I expect I talked past you and you, accordingly, haven’t seen the thing I’m trying to point at.
this isn’t a quote, this is just how your comment parsed to me
The question of how strongly training pressures models to minimize loss is one that I isolate and discuss explicitly in the report, in section 1.5, “On ‘slack’ in training”—and at various points the report references how differing levels of “slack” might affect the arguments it considers. Here I was influenced in part by discussions with various people, yourself included, who seemed to disagree about how much weight to put on arguments in the vein of: “policy A would get lower loss than policy B, so we should think it more likely that SGD selects policy A than policy B.”
(And for clarity, I don’t think that arguments of this form always support expecting models to do tons of reasoning about the training set-up. For example, as the report discusses in e.g. Section 4.4, on “speed arguments,” the amount of world-modeling/instrumental-reasoning that the model does can affect the loss it gets via e.g. using up cognitive resources. So schemers—and also, reward-on-the-episode seekers—can be at some disadvantage, in this respect, relative to models that don’t think about the training process at all.)
I haven’t “taken this discussion to Twitter”. Joe Carlsmith posted about the paper on Twitter. I saw that post, and wrote my response on Twitter. I didn’t even know it was also posted on LW until later, and decided to repost the stuff I’d written on Twitter here. If anything, I’ve taken my part of the discussion from Twitter to LW. I’m slightly baffled and offended that you seem to be platform-policing me?
Anyways, it looks like you’re making the objection I predicted with the paragraphs:
In particular, when I said “maybe because they’re a short way to compress a motivational pointer to “wanting” to do well on the training objective.” I think this is pointing at the same thing you reference when you say “The entire question is about what the easiest way is to produce that distribution in terms of the inductive biases.”
I.e., given the actual simplicity bias of models, what is the shortest (or most compressed) way of specifying “a model that starts by trying to do well in training”? And my response is that I think the model pays a complexity penalty for runtime computations (since they translate into constraints on parameter values which are needed to implement those computations). Even if those computations are motivated by something we call a “goal”, they still need to be implemented in the circuitry of the model, and thus also constrain its parameters.
Also, when I reference models whose internal cognition looks like “[figure out how to do well at training] [actually do well at training]”, I don’t have sycophantic models in particular in mind. It also includes aligned models, since those models do implement the “[figure out how to do well at training] [actually do well at training]” steps (assuming that aligned behavior does well in training).
Good point. I think I’m misdirecting my annoyance here; I really dislike that there’s so much alignment discussion moving from LW to Twitter, but I shouldn’t have implied that you were responsible for that—and in fact I appreciate that you took the time to move this discussion back here. Sorry about that—I edited my comment.
Yes, I think we agree there. But that doesn’t imply that just because deceptive alignment is a way of calculating what the training process wants you to do, that you can then just memorize the result of that computation in the weights and thereby simplify the model—for the same reason SGD doesn’t memorize the entire distribution in the weights either.
(Partly re-hashing my response from twitter.)
I’m seeing your main argument here as a version of what I call, in section 4.4, a “speed argument against schemers”—e.g., basically, that SGD will punish the extra reasoning that schemers need to perform.
(I’m generally happy to talk about this reasoning as a complexity penalty, and/or about the params it requires, and/or about circuit-depth—what matters is the overall “preference” that SGD ends up with. And thinking of this consideration as a different kind of counting argument *against* schemers seems like it might well be a productive frame. I do also think that questions about whether models will be bottlenecked on serial computation, and/or whether “shallower” computations will be selected for, are pretty relevant here, and the report includes a rough calculation in this respect in section 4.4.2 (see also summary here).)
Indeed, I think that maybe the strongest single argument against scheming is a combination of
“Because of the extra reasoning schemers perform, SGD would prefer non-schemers over schemers in a comparison re: final properties of the models” and
“The type of path-dependence/slack at stake in training is such that SGD will get the model that it prefers overall.”
My sense is that I’m less confident than you in both (1) and (2), but I think they’re both plausible (the report, in particular, argues in favor of (1)), and that the combination is a key source of hope. I’m excited to see further work fleshing out the case for both (including e.g. the sorts of arguments for (2) that I took you and Nora to be gesturing at on twitter—the report doesn’t spend a ton of time on assessing how much path-dependence to expect, and of what kind).
Re: your discussion of the “ghost of instrumental reasoning,” “deducing lots of world knowledge ‘in-context,’ and “the perspective that NNs will ‘accidentally’ acquire such capabilities internally as a convergent result of their inductive biases”—especially given that you only skimmed the report’s section headings and a small amount of the content, I have some sense, here, that you’re responding to other arguments you’ve seen about deceptive alignment, rather than to specific claims made in the report (I don’t, for example, make any claims about world knowledge being derived “in-context,” or about models “accidentally” acquiring flexible instrumental reasoning). Is your basic thought something like: sure, the models will develop flexible instrumental reasoning that could in principle be used in service of arbitrary goals, but they will only in fact use it in service of the specified goal, because that’s the thing training pressures them to do? If so, my feeling is something like: ok, but a lot of the question here is whether using the instrumental reasoning in service of some other goal (one that backchains into getting-reward) will be suitably compatible with/incentivized by training pressures as well. And I don’t see e.g. the reversal curse as strong evidence on this front.
Re: “mechanistically ungrounded intuitions about ‘goals’ and ‘tryingness’”—as I discuss in section 0.1, the report is explicitly setting aside disputes about whether the relevant models will be well-understood as goal-directed (my own take on that is in section 2.2.1 of my report on power-seeking AI here). The question in this report is whether, conditional on goal-directedness, we should expect scheming. That said, I do think that what I call the “messyness” of the relevant goal-directedness might be relevant to our overall assessment of the arguments for scheming in various ways, and that scheming might require an unusually high standard of goal-directedness in some sense. I discuss this in section 2.2.3, on “‘Clean’ vs. ‘messy’ goal-directedness,” and in various other places in the report.
Re: “long term goals are sufficiently hard to form deliberately that I don’t think they’ll form accidentally”—the report explicitly discusses cases where we intentionally train models to have long-term goals (both via long episodes, and via short episodes aimed at inducing long-horizon optimization). I think scheming is more likely in those cases. See section 2.2.4, “What if you intentionally train the model to have long-term goals?” That said, I’d be interested to see arguments that credit assignment difficulties actively count against the development of beyond-episode goals (whether in models trained on short episodes or long episodes) for models that are otherwise goal-directed. And I do think that, if we could be confident that models trained on short episodes won’t learn beyond-episode goals accidentally (even irrespective of mundane adversarial training—e.g., that models rewarded for getting gold coins on the episode would not learn a goal that generalizes to caring about gold coins in general, even prior to efforts to punish it for sacrificing gold-coins-on-the-episode for gold-coins-later), that would be a significant source of comfort (I discuss some possible experimental directions in this respect in section 6.2).