A Dialogue on Deceptive Alignment Risks

I have long been puzzled by the wide differences in the amount of concern established alignment researchers show for deceptive alignment risks. I haven’t seen a post that clearly outlines the cruxes: much of the relevant conversation is buried deep within LessWrong comment sections and shortform threads or isn’t on the forum at all. Joe Carlsmith’s report on scheming gives an excellent overview of the arguments but is 127 pages long. This post is my attempt to give a relatively short overview of the most important arguments on deceptive alignment through a dialogue between Skeptic and Advocate, fictional characters on the two sides of the debate.[1]

Definitions and scope

Skeptic: “First, let’s settle on a shared definition. When I speak of deceptive alignment, I mean the specific failure mode described in Risks from Learned Optimization: an ML model is within a training process and reasons that it should emit particular outputs to avoid the training process modifying it in an undesired way. I do NOT talk about deception in general, where an AI is interacting with humans for any reason and knowingly provides the human with false or misleading information. In other words, Deceptive AI ≠ Deceptively-aligned AI.

Second, let’s define the bounds of the argument. When I say I assign a probability of <1% to deceptive alignment occurring, I mean a probability of <1% before human obsolescence. I’m not claiming that our future AI successors won’t be worrying about deception when they’re building literal galaxy brains that are nothing like today’s LLMs. Do we agree on both of these points?”

Advocate: “Yep, sounds good.”

Skeptic: “Great. Given this definition and scope, I claim that the likelihood of deceptive alignment occurring is somewhere between 0.1% and 1%. What’s your estimate?”

Advocate: “I would put it at 40%.”

How might deceptive alignment arise?

Skeptic: “This is a bold claim, given the lack of empirical evidence we have seen for deceptive alignment. Where does this number come from?”

Advocate: “Well, let’s start by stating the obvious. When subject to selection pressure, humans often misrepresent their true motivations to get ahead. The easiest way to find deceptively aligned humans is to look at job seekers or politicians. This is a very basic fact about human cognition and already sufficient to consider the hypothesis as a serious possibility. Now, I also claim that the hypothesis is fairly likely to be true, and there are at least three plausible ways in which I can imagine deceptive alignment naturally arising in current LLMs.”

Skeptic: “I don’t think the analogy with humans is sufficient to make strong claims about AI models. There must be shared causal mechanisms that cause similar kinds of cognition to arise. When we look at the possible causal mechanisms for prosaic models to naturally develop deception, there seem to be only two alternatives, both very difficult:

  1. The model would have to develop circuits that robustly re-derive in each forward pass/​CoT instance that the model’s goals imply a need to be deceptive, or

  2. The model would have to plot to gradient-hack deceptive circuits into itself during some pivotal forward passes/​CoT instances.

For the first one, the model would have to first figure out the strategic situation it’s in, then think about the misalignment between its goals and the goals the training process is trying to steer it towards, and then calculate the outputs that an aligned model would have produced. All of this would have to happen within a single forward pass, over and over again. This seems like a pretty significant hit on performance, and also like it requires more sequential steps than current models can perform within a single forward pass, especially given that they appear to be ensembles of shallower networks. If the model performed this reasoning using CoT, it would have to be exceptionally good at steganography, otherwise we would easily notice its deceptive tendencies. For the second one… man, gradient hacking just seems incredibly hard.

Maybe the three plausible ways you have in mind address these difficulties—what concrete scenarios are you imagining?”

Advocate: “I’ll call the three scenarios the long-horizon RL scenario, the malign pre-training scenario, and the predicting other AIs scenario. Let’s go through all of them in detail:

  1. Long-horizon RL: The scenario I’m most strongly concerned about involves the emergence of deceptive alignment after pre-training. I believe the RLHF conditioning hypothesis: the claim that current fine-tuning methods are well-modeled as producing pre-trained models conditioned on specific prompts, rather than e.g. as producing agents. I expect this to change in the future: there’s a cap on how much you can achieve with prompting, while the economic demand for agentic and autonomous models is almost limitless. We will start doing more long-horizon RL training which will ingrain coherent policies within models. There is abundant literature showing that this usually results in models that learn simple proxies of the true reward. We also see this in our current short-horizon fine-tuning setups: models learn to be sycophantic as a proxy to being helpful. Most of such proxy rewards involve deceptive instrumental alignment as the optimal policy. Once the model is in the post-training stage, it is likely to already have developed the circuits necessary to perform the sort of reasoning that you claimed to be required for deceptive alignment, so the use of those circuits won’t have much of an effect on performance. Thus, I assign a probability of 20-25% to this scenario.

  2. Malign pre-training: This scenario is based on the hypothesis that pre-training on next token prediction actually leads to agentic behavior in the limit, but current models are not close to optimal predictors and thus incoherent. It seems implausible based on our observations so far, but agency is a very effective form of compression, so I wouldn’t rule this out completely. A probability of 5% on this scenario seems fair.

  3. Predicting other AIs: A more likely story about how deceptive alignment can arise in pre-trained models is the one described in the Conditioning Predictive Models sequence. In this scenario, future pre-trained models will still be well-described as predictors, in the sense of the word defined in the Simulators post. Given that the training data of those future predictors will contain increasing amounts of AI-generated text, it seems likely that future predictive models will predict that some of the prompts they are conditioned on were generated by other AI models.

    An analogy with chess might be helpful here. Since chess computers are superhuman, a predictive model very good at modeling chess games conditioned on part of a chess game between two players with Elo ratings of >3000 would benefit from modeling this as a game between two AI models rather than humans and predicting the next moves accordingly. Similarly, as AI systems get better, it becomes increasingly useful to model especially strong prompts to have come from a strong AI model rather than from a human. To continue such sequences, future models will be predicting the outputs of other AI models, which creates a self-fulfilling prophecy: if the model expects other highly capable AI systems to be aligned, it will generate aligned completions, and vice versa. If your model is predicting what a deceptively aligned model would do, then even if it’s not deceptively aligned itself, it will produce deceptively aligned outputs. I put a probability of 15-20% on this scenario.

In summary, though I agree with you that gradient hacking is difficult, I think that the long-horizon RL and predicting other AIs scenarios both get around the first difficulty you mentioned. Thus, deceptive alignment seems like a substantial risk even if your position is that these are the only two ways to get around those difficulties.”

Skeptic:What we observe today doesn’t provide strong evidence for any of these scenarios. An alternative prediction that seems much more likely is that LLMs will keep doing what they’re told to do, probably even better than today as fine-tuning for instruction-following improves. They may be goal-directed in specific contexts when prompted to pursue goals, but won’t suddenly become goal-directed on their own. We will get tool AIs that just do what you say. There is still a chance of an AI being scaffolded into pursuing consistent goals, but that’s a very different risk from the one current work on deceptive alignment is trying to address.

To touch upon all three scenarios individually:

  1. Long-horizon RL: I agree that deceptive alignment is significantly more likely in worlds where models are deliberately trained to be goal-directed through long-horizon RL or self-play. However, I think it’s highly disputable whether the model would develop sufficient coherence for deceptive alignment inside such training procedures before reaching an AGI-like level and whether the model would be likely to develop all of the foundational properties for deceptive alignment in an order favorable for deceptive alignment. In particular, knowing that today’s RLHF-d models grasp human values pretty well, we can assume that when a model is put into a long-horizon RL training environment, it will already have an understanding of the base goal at the start of training, which disincentivizes the formation of proxy goals. Even if deceptive proxy goals did arise, they would have to emerge as a step change to persist through the post-training process: I expect us to run deception evals at multiple checkpoints over the post-training process, and those would catch models that are only starting to develop deceptive tendencies.

    More importantly, I think it’s quite unlikely we will start training models on outcome-based RL over extremely long time horizons before human obsolescence. We have good reasons to expect that models trained through imitation learning will become capable of automating ML R&D before acquiring the ability to perform non-trivial consequentialist reasoning in a forward pass, while we haven’t seen any evidence of successful long-horizon RL protocols applied on pre-trained LLMs.[2] It seems reasonable to expect that scaffolded agents will become much better as pre-trained models improve, remaining the dominant architecture. Though the scaffolding system as a whole can be viewed as pursuing consistent goals, the risks from such systems are very different from the ones commonly discussed in the alignment community, since the agency comes from the system of subroutine calls to the LLM rather than from the LLM itself. As long as we stick to post-training techniques that are close to current RLHF protocols, I expect that models will keep exhibiting deceptive behavior on some particular inputs, but there’s no reason to expect that this will generalize to systemic deceptive alignment.

  2. Malign pre-training: Again, this is something that we don’t currently have empirical evidence for. My subjective probability for this scenario is even lower than yours—if it was at all likely, we would see hints of agentic cognition in current language models.

  3. Predicting other AIs: This argument is stacking a bunch of questionable assumptions on top of each other. What empirical evidence do we have which would lead us to expect that future AIs will be modeling other AIs when solving hard problems? What empirical evidence do we have that they will model other AIs as being deceptively aligned, a very narrow and specific region within the space of possible AIs? Even assuming that those assumptions hold, what empirical evidence do we have that models are doing pure prediction? Is there any evidence that human preferences are only predicted by GPTs and not preferred? As Alex Turner puts it: “I am not aware of any basis for confidently claiming that LLMs like GPT-4 are “only predicting what comes next”, as opposed to “choosing” or “executing” one completion, or “wanting” to complete the tasks they are given, or—more generally—”making decisions on the basis of the available context, such that our ability to behaviorally steer LLMs (e.g. reducing sycophancy) is real evidence about our control over LLM motivations.”″”

Advocate: “I agree with some of these critiques, but these just bring my credence in scheming down from 60% to 40%, rather than to <1%. It seems to me like you’re excessively anchoring your beliefs on the cognition of current models, even though we have strong reasons to expect some changes in the cognition of the models before we get to literal AGI: see, e.g., Gwern’s claim that Tool AIs will outcompete Agent AIs and Thane Ruthenis’s claim that LLMs’ algorithms do not necessarily generalize to how an AGI’s cognition will function. I agree there’s some chance that scaffolded LLM agents will already be useful enough to lead to human obsolescence, but that’s certainly not the scenario that occupies 99% of the outcome space. We have strong reasons to expect that deceptive cognition will be more natural for more AGI-like models, and we should also expect that once a model has developed deceptive cognition, it will be hard to get rid of it: as the sleeper agents paper showed, models that use deceptive reasoning can be a stable equilibrium of alignment training.”

What does the sleeper agents paper say about the likelihood of deceptive alignment?

Skeptic: “I don’t think the sleeper agents paper shows much about deceptive alignment. I like the way Alex Turner formulated the main message of the paper without using the phrase “deceptive alignment”: “We train an AI directly to do X only in context Y. It does X in context Y. Standard techniques are not able to undo this without knowing the context Y. Furthermore, the AI seems to reason consequentially about how to do X after we directly trained it to do so.” Yes, the paper has a bunch of well-run experiments and interesting results, but the main takeaway for me is that adversarial training has narrower effects than I thought, not that deceptive alignment is real or hard to uproot. Uprooting a backdoored behavior is different from pushing back against misgeneralization. As the authors themselves acknowledge, the paper doesn’t present a real case of deceptive alignment where such cognition arises naturally, but rather a hard-coded demo specifically designed to elicit deceptive behavior.”

Advocate: “The sleeper agents work indeed doesn’t make any claims about deceptive alignment being likely to naturally emerge, but it does provide a strong existence proof that it can be hard to uproot. I also think that the results are uniquely relevant to deceptive alignment—for example, the paper shows that:

  • The deceptive reasoning trained into the models generalizes to OOD contexts (see section 7.1 of the sleeper agents paper). Furthermore, the reasoning they exhibit is consistent with deceptive alignment on persona evaluations.

  • The deceptive reasoning that the models produce has a causal effect on their decisions: if the reasoning is modified, the policy changes.

  • When increasing independent variables that increase the extent to which the studied models are analogous to realistic examples of deceptive alignment—deceptive alignment reasoning and model size—, robustness to safety training increases.”

Skeptic: “I agree that the paper has taken steps to increase relevance to deceptive alignment. I still disagree that the results provide unique evidence that should significantly shake my beliefs, though. The local inductive biases are probably far different for “parameterization implementing directly-trained deceptive reasoning” vs “parameterization outputting deceptive reasoning as an instrumental backchain from consequentialist reasoning.” Thus, the evidence that the paper offers for what happens in the latter case is tiny.”

Advocate: “I strongly disagree. The model was clearly using deceptive reasoning to compute its answers, which is exactly what it’s expected to do in the scenario where deception arises naturally, and safety training not only failed to train it away but also amplified the deceptive reasoning in the case of adversarial training. I guess we’ll just disagree on this until Anthropic’s Alignment Stress Testing Team finishes their experiments on model organisms for naturally arising deception, though, so let’s move on to my next point. Namely, even if I didn’t believe any of the three threat models I presented above and didn’t buy the relevance of sleeper agents, I would still assign deceptive alignment a probability higher than 1% based on the theoretical arguments from SGD inductive biases.”

Counting and simplicity arguments

Skeptic: “Can you walk me through the argument?”

Advocate: “My quick summary of the central argument is that neural networks are highly biased towards simplicity and deceptive alignment is a very simple way to solve a task, as it allows any simple proxy to reach optimal training performance as an instrumental subgoal.

To make a more detailed argument, let’s first distinguish between three model classes:

  • internally aligned models, which optimize for the base objective because it’s equivalent to their base objective,

  • corrigibly aligned models, which optimize for the base objective because their terminal objective is a robust pointer to a representation of the base objective, and

  • deceptively aligned models, which optimize for the base objective for instrumental reasons.[3]

Now, the argument has three layers:

  • the counting argument: deceptively aligned models are more abundant than internally and corrigibly aligned ones,

  • the simplicity argument: the simplest algorithm that implements a deceptively aligned model is substantially simpler than the simplest algorithm that implements an internally aligned model, and

  • the SGD inductive bias argument: the simplicity prior is a good first approximation for how SGD works.

The counting argument goes as follows. Intuitively, internally aligned models are highly specific. Only a small number of possible objectives are equal to the base objective. Corrigibly aligned models are also very specific. Out of all possible pointers, there are only very few pointers that robustly point at a model of the base objective. Deceptively aligned models, in contrast, are numerous. There are lots of objectives that involve optimizing for the base objective during training as an instrumental subgoal.

I would also argue that some of the deceptive goals are significantly simpler than the simplest aligned objective. The simplest deceptive objective is whatever the simplest long-term objective is: the only necessary precondition for deceptive alignment is an objective that extends beyond a single parameter update. Everything else can be specified while in deployment. As I explained above, I believe that the bulk of the risk from deceptive alignment comes from scenarios where we make models more agentic through RL post-training. I don’t think non-agentic next token predictors will be economically competitive at the time we reach AGI. This RL training will most likely bake a long-term objective into the model, which we want to be aligned to our values. Our values are at least somewhat complex, so I don’t expect any of the aligned objectives to be the literally the simplest possible long-term objective. Internalizing one of the simplest possible long-term objectives is all a model needs to become deceptively aligned.

Finally, the simplicity prior seems like a fairly good first approximation of the prior that guides SGD. I certainly don’t think it perfectly describes the inductive biases of modern neural nets, but given work by Mingard et al. showing that NNs are biased towards simpler functions, as well as evidence from double descent that larger models have a stronger simplicity bias, it seems like a reasonable one. A more realistic prior might be e.g. a combination of the simplicity and speed priors, but this doesn’t seem to change the bottom line. There has been plenty of unsuccessful work trying to find priors that would change the bottom line: for example, see here for a post exploring the circuit prior.[4]

Skeptic:These sorts of counting arguments don’t provide any evidence for deceptive alignment. We can make a structurally equivalent counting argument that is clearly false:

  1. Neural networks must implement fairly specific functions in order to generalize beyond their training data.

  2. By contrast, networks that overfit to the training set are free to do almost anything on unseen data points.

  3. In this sense, there are “more” models that overfit than models that generalize.

  4. So, other things equal, we should expect SGD to select a model that overfits.

The mistake lies in a faulty application of the principle of indifference, assuming that we should assign equal probability to each possible outcome of a random process, while in reality we should reason about the causal mechanisms that can make some outcomes much more likely than others.”

Advocate: “I’ll borrow an argument made by Evan Hubinger to show why this analogy doesn’t work. Consider two programs running on a Turing machine with an infinite bitstring. One of them takes n bits and then stops, and the other takes 2n bits to specify the necessary logic and then uses m additional bits to specify how it behaves on unseen inputs. By your argument, the 2n bit program would appear simpler because it leaves all those possible ways we could fill in the remaining m bits unspecified, while the n bit program can only be implemented in one way. Once you look at the infinite bitstring, though, you’ll see that the n bit program is actually leaving n + m bits unspecified in comparison to the 2n bit program. You can fill these bits up with whatever you want, e.g. meaningless comments, and you’ll get a different implementation of the n bit program that is more complex.

Now, let’s directly discuss why my counting argument doesn’t suffer from the analogy you mentioned.

  • Suppose my model has an n bit world model, an m bit search procedure, and an x bit objective function.

  • Assume that the deceptive and non-deceptive models require the same n + m bits of world modeling and search, so the only difference in the model selection process is the objective function.

  • Suppose that the complexity of the simplest deceptive objective that achieves sufficiently low loss in training is l + b, where l is the number of bits needed to select the part of the objective space which includes only long-term objectives and b is the additional number of bits required to select the simplest long-run objective. Why split it up into two parts? Because the only thing required for deception is having a long-term objective, everything else is unspecified.

  • Suppose also that the complexity of the simplest aligned objective that achieves sufficiently low loss is a, and a < l + b, but l < a. Why assume l < a? See my simplicity argument above.

  • We can see now that the set of aligned objectives includes those with anything after the first a bits, while the set of deceptive objectives includes anything after the first l bits. Even though you don’t get a full program until you’re l + b bits deep, the complexity here is just l, because the bits after l don’t have to be specified in order to get deception.

  • Thus, assuming l < a, deception wins. You can contest this assumption, but the argument is sound in any case.[5]

To sum up the long reply: The fact that we don’t see overfitting is the entire reason why deceptive alignment is likely. The fact that models tend to learn simple patterns that fit the data rather than memorize a bunch of stuff is exactly why deception, a simple strategy that compresses a lot of data, might be a very likely thing for them to learn.”

Skeptic: “I concede that my analogy doesn’t directly apply to your counting argument. I no longer think your counting argument is invalid. However, I still don’t think we have any evidence that this reasoning over Turing machines would transfer over to neural networks. When reasoning about real neural nets, my impression is that the advocates for deceptive alignment risks are making claims about functions rather than parameterizations, given that popular presentations often go like “there is only one saint, many sycophants, and even more schemers.” Saying that there’s only one internally aligned model doesn’t make any sense if you’re talking about parameterizations. You can have, using your terminology, an “infinite number of saints” thanks to invariances in the weight space. If you ran your counting argument over parameterizations of functions in real neural nets, I’d be all ears, but I don’t think we understand parameterizations well enough to have any confidence in counting arguments over them.”

Advocate: “I do, in fact, agree that the correct way to run these counting arguments is to run them in the parameter space. I also agree that some of the public arguments are sloppy and would benefit from a clearer presentation; I certainly don’t think saying “there is only one saint” is technically accurate, but understand why the simplification has been made at times.

A comment by Ryan Greenblatt gives a good overview of what I actually believe. Looking at the initialization space of a neural network, you can ask: how many bits are required to specify a subset that defines an aligned objective, relative to other possible subsets that achieve a low loss on the training objective? The best way we can currently answer this question is to take the formalism I presented above, use whatever prior you think most realistically reflects deep learning while being possible to reason about theoretically (this is the important step which ensures that the argument is not just about the Solomonoff prior!), modify it to search only over fixed-size algorithms, and then run the counting argument using this modified version of my formalism.”

Skeptic: “This is much more reasonable than running counting arguments over functions. I’m glad that our beliefs are closer than I first thought. I still believe your formalism is unable to describe the full selection power being applied by SGD in the actual NN parameter space to the set of things we’re counting over, since I don’t trust your claim that it’s simple to choose a prior that realistically reflects how SGD works on neural nets: see, e.g., this paper explaining why the simplicity bias can be extremely hard to reason about in the NN prior. Additionally, there are several factors much more specific than simplicity and speed that should be taken into account, such as the fact that modern neural nets are ensembles of shallower networks biased towards low-frequency functions. These factors seem to reduce the likelihood of scheming.

The counting argument also still doesn’t seem that relevant to me, as I don’t buy the assumption that there’s going to be an inner objective distinct from inner capabilities. I agree with you that we could get persistently-activated consequentialist reasoning if we trained models directly to achieve long-range goals in a consequentialist way, but this argument doesn’t require all the complicated reasoning about sampling infinite bitstrings on a universal Turing machine using the simplicity prior.

As the last point, I want to quickly touch on your simplicity argument. Since we both agree that deceptive alignment is most likely to emerge in post-training, I don’t think it makes sense to compare the complexity of the simplest deceptive goals to the complexity of the simplest aligned goals. Current pre-trained models already have an excellent representation of human values: it’s incredibly easy to get them to understand the spirit of my instructions. By your long-horizon RL story, the models will already know the relevant concepts at the start of the long-term objective formation process; they’ll just need to build an association between those concepts and reward. This makes a goal that takes the form of pointer(s) seem much more likely to me than either the formation of an internally aligned or a deceptively aligned objective from scratch. Assuming this is the case, it seems to me that given a very large model, the simplicity differences between aligned and unaligned objectives may be negligibly small.[6]

Advocate: “Regarding the point on a clean separation between the inner objective and inner capabilities, I’m not sure how much that matters: humans also don’t seem to have this sort of clean separation, but are clearly sufficiently goal-directed to plot very complicated deceptive schemes. The important property for deceptive alignment is coherent, beyond-episode pursuit of a single goal. The assumption of a clean separation just makes it easier to reason about the counting and simplicity arguments.

I have answers to your other points as well, but seeing that the length of our conversation is about to eclipse 5,000 words, I’ll leave those for another time. I agree with your criticism on the relevance of the counting and simplicity arguments to some extent—I think the field has progressed far enough that it’s possible to study deceptive alignment empirically, and this should be the main focus of future research.”

Skeptic: “I fully agree that further empirical work is the way to go. This is a good message to finish with.”


Ok, cool; what’s your own take?

On the question of how deceptive alignment might arise, I side with the Advocate. I haven’t seen a convincing argument for why my default expectation should be that training procedures which include long-horizon RL on agentic tasks won’t be needed to reach models that lead to human obsolescence. Once one concedes that there’s a considerable chance the first truly transformative AGI model will be trained with long-horizon RL, it’s hard to see how the probability of deceptive alignment can stay below 1%, since the Skeptic agrees that this scenario merits substantially higher concern for deceptive alignment. I wouldn’t personally put a high probability on the other two paths to deceptive alignment that the advocate presents, but the threat from the long-horizon RL scenario seems significant enough on its own to take deceptive alignment as a very serious possibility.

For sleeper agents, I’m inclined to agree with the Skeptic that uprooting goal misgeneralization is a somewhat different problem from uprooting backdoored behavior. However, I also buy that the reasoning the model outputs in the sleeper agents paper is actual deceptive reasoning that’s behaviorally very close to what a truly misgeneralized model would output. Based on this, it seems reasonable to conclude that current safety protocols don’t apply particularly strong selection pressure to uproot this sort of reasoning, and we don’t seem to have good reasons to expect that the goal misgeneralization case requires less selection pressure compared to backdoors to be uprooted. Furthermore, the paper provides convincing evidence that deceptive reasoning becomes an increasingly small overhead as the model gets larger and uses more inference-time compute. My takeaway is that the sleeper agents work presents a pretty compelling case that models which use deceptively aligned reasoning can be a stable equilibrium of alignment training, but definitely doesn’t yet provide a model organism that exhibits real deceptive alignment in the wild.

On the topic of theoretical arguments, I tend to agree with the Advocate that SGD is already significantly biased towards simpler functions, and I can’t see a reason why this would cease to be the case when more capable models are trained. On the other hand, more capable models will have to be able to implement all sorts of complex behaviors with a huge number of serial steps, so it’s unclear whether SGD would still pick models that are e.g. well-described as ensembles of shallower networks at AGI-like level. Thus, I lean on the side of the Advocate in believing that we can reasonably choose a high-level prior like the simplicity prior and run counting arguments using it.

I am sympathetic to both the Skeptic’s argument that the simplicity differences between aligned and deceptive goals might be fairly negligible and the argument that goal slots might not be as natural as the Advocate makes them seem. However, these arguments seem insufficient to fully reject the Advocate’s arguments. Overall, though, it seems impossible for now to fully decouple the theoretical arguments from debatable assumptions, so I agree with both sides that further empirical research should be the priority for future work on deceptive alignment.

  1. ^

    Some usual disclaimers: I don’t claim to have invented any of the arguments here myself. The Skeptic was inspired by my understanding of Alex Turner’s views and the Advocate by my model of Evan Hubinger’s views, but the Skeptic doesn’t represent Turner and the Advocate doesn’t represent Hubinger, of course. I have tried to link to the original sources of the arguments wherever possible.

  2. ^

    Maybe such RL training was used for OpenAI’s o1, but there’s no strong evidence about that for now.

  3. ^

    This decomposition leaves out myopic models, but this seems like a reasonable assumption, both because they seem uncompetitive at AGI-like level and because this makes the argument simpler.

  4. ^

    This argument is mostly focused on low path dependence worlds—worlds where SGD converges to basically the same model in every training run given that you train for long enough. For an overview of a similar argument focused on high path dependence worlds—worlds where different training runs result in models with very different properties—, see this post by Mark Xu.

  5. ^

    Advocate: “Note that there are subtleties that this framework doesn’t account for:

    • Real-world AI models are not going to be literally perfectly aligned to the base objective. A more detailed analysis should consider non-exactly-aligned objectives, as well as objectives that are unaligned but also not deceptive, e.g. thanks to myopia.

    • Real-world models are not going to literally implement a pure simplicity bias.

    These subtleties do not seem to change the bottom line. I have already explained why I expect myopic models to eventually be outcompeted by agentic ones, as well as why I expect the combination of the simplicity prior with other ones to still result in a malign prior.”

  6. ^

    Skeptic: “To see this, take the following argument from Carlsmith’s report on scheming. Suppose that the model has 250 concepts in its world model that could in principle be turned into goals. The average number of bits required to code for each of 250 concepts can’t be higher than 50. So if we assume that the model’s encoding is reasonably efficient with respect to the average, then if we allocate one parameter per bit, pointing at the simplest non-schemer-like max-reward goal is only an extra 50 parameters at maximum. This is one twenty-billionth of a trillion-parameter model’s capacity. This argument obviously makes a bunch of simplifications and assumptions again, but I hope it conveys my intuition that by the time a very large model reaches pre-training, the simplicity differences will have a trivial effect on the optimization process and there are several other factors that are going to matter way more.”