Deceptive alignment seems to only be supported by flimsy arguments. I recently realized that I don’t have good reason to believe that continuing to scale up LLMs will lead to inner consequentialist cognition to pursue a goal which is roughly consistent across situations. That is: a model which not only does what you ask it to do (including coming up with agentic plans), but also thinks about how to make more paperclips even while you’re just asking about math homework.
Aside: This was kinda a “holy shit” moment, and I’ll try to do it justice here. I encourage the reader to do a serious dependency check on their beliefs. What do you think you know about deceptive alignment being plausible, and why do you think you know it? Where did your beliefs truly come from, and do those observations truly provide
P(observations∣deceptive alignment is how AI works)P(observations∣deceptive alignment is not how AI works)>>1?
I agree that conditional on entraining consequentialist cognition which has a “different goal” (as thought of by MIRI; this isn’t a frame I use), the AI will probably instrumentally reason about whether and how to deceptively pursue its own goals, to our detriment.
I contest that there’s very little reason to expect “undesired, covert, and consistent-across-situations inner goals” to crop up in LLMs to begin with. An example alternative prediction is:
LLMs will continue doing what they’re told. They learn contextual goal-directed behavior abilities, but only apply them narrowly in certain contexts for a range of goals (e.g. think about how to win a strategy game). They also memorize a lot of random data (instead of deriving some theory which simply explains its historical training data a la Solomonoff Induction).
Not only is this performant, it seems to be what we actually observe today. The AI can pursue goals when prompted to do so, but it isn’t pursuing them on its own. It basically follows instructions in a reasonable way, just like GPT-4 usually does.
Why should we believe the “consistent-across-situations inner goals → deceptive alignment” mechanistic claim about how SGD works? Here are the main arguments I’m aware of:
These ignore the importance of the parameter->function map. (They’re counting functions when they need to be counting parameterizations.) Classical learning theory made the (mechanistically) same mistake in predicting that overparameterized models would fail to generalize.
I also basically deny the relevance of the counting argument, because I don’t buy the assumption of “there’s gonna be an inner ‘objective’ distinct from inner capabilities; let’s make a counting argument about what that will be.”
Speculation about simplicity bias: SGD will entrain consequentialism because that’s a simple algorithm for “getting low loss”
I think it’s unrealistic to imagine that we have the level of theoretical precision to go “it’ll be a future training process and the model is ‘getting selected for low loss’, so I can now make this very detailed prediction about the inner mechanistic structure.”[1]
I falsifiably predict that if you try to use this kind of logic or counting argument today to make falsifiable predictions about unobserved LLM generalization, you’re going to lose Bayes points left and right.
Instead, I think that we enter the realm of tool AI[2] which basically does what you say.[3] I think that world’s a lot friendlier, even though there are still some challenges I’m worried about—like an AI being scaffolded into pursuing consistent goals. (I think that’s a very substantially different risk regime, though)
This is what we actually have today, in reality. In these setups, the agency comes from the system of subroutine calls to the LLM during e.g. a plan/critique/execute/evaluate loop a la AutoGPT.
I agree a fraction of the way, but when doing a dependency check, I feel like there are some conditions where the standard arguments go through.
I sketched out my view on the dependencies Where do you get your capabilities from?. The TL;DR is that I think ChatGPT-style training basically consists of two different ways of obtaining capabilities:
Imitating internet text, which gains them capabilities to do the-sorts-of-things-humans-do because generating such text requires some such capabilities.
Reinforcement learning from human feedback on plans, where people evaluate the implications of the proposals the AI comes up with, and rate how good they are.
I think both of these are basically quite safe. They do have some issues, but probably not of the style usually discussed by rationalists working in AI alignment, and possibly not even issues going beyond any other technological development.
The basic principle for why they are safe-ish is that all of the capabilities they gain are obtained through human capabilities. So for example, while RLHF-on-plans may optimize for tricking human raters to the detriment of how the rater intended the plans to work out, this “tricking” will also sacrifice the capabilities of the plans, because the only reason more effective plans are rated better is because humans recognize their effectiveness and rate them better.
Or relatedly, consider nearest unblocked strategy, a common proposal for why alignment is hard. This only applies if the AI is able to consider an infinitude of strategies, which again only applies if it can generate its own strategies once the original human-generated strategies have been blocked.
Why should we believe the “consistent-across-situations inner goals → deceptive alignment” mechanistic claim about how SGD works? Here are the main arguments I’m aware of:
These are the main arguments that the rationalist community seems to be pushing about it. For instance, one time you asked about it, and LW just encouraged magical thinking around these arguments. (Even from people who I’d have thought would be clearer-thinking)
The non-magical way I’d analyze the smiling reward is that while people imagine that in theory updating the policy with antecedent-computation-reinforcement should make it maximize reward, in practice the information signal in this is very sparse and imprecise, so in practice it is going to take exponential time, and therefore something else will happen beforehand.
What is this “something else”? Probably something like: the human is going to reason about whether the AI’s activities are making progress on something desirable, and in those cases, the human is going to press the antecedent-computation-reinforcement button. Which again boils down to the AI copying the human’s capabilities, and therefore being relatively safe. (E.g. if you start seeing the AI studying how to deceive humans, you’re probably gonna punish it instead, or at least if you reward it then it more falls under the dual-use risk framework than the alignment risk framework.)
(Of course one could say “what if we just instruct the human to only reward the AI based on results and not incremental progress?”, but in that case the answer to what is going to happen before the AI does a treacherous turn is “the company training the AI runs out of money”.)
There’s something seriously wrong with how LWers fixated on reward-for-smiling and rationalized an explanation of why simplicity bias or similar would make this go into a treacherous turn.
OK, so this is how far I’m with you. Rationalist stories of AI progress are basically very wrong, and some commonly endorsed threat models don’t point at anything serious. But what then?
The short technical answer is that reward is only antecedent-computation-reinforcement for policy-gradient-based reinforcement learning, and model-based approaches (traditionally temporal-difference learning, but I think the SOTA is DreamerV3, which is pretty cool) use the reward to learn a value function, which they then optimize in a more classical way, allowing them to create novel capabilities in precisely the way that can eventually lead to deception, treacherous turn, etc..
One obvious proposal is “shit, let’s not do that, chatgpt seems like a pretty good and safe alternative, and it’s not like there’s any hype behind this anyway”. I’m not sure that proposal is right, because e.g. AlphaStar was pretty hype, and it was trained with one of these methods. But it sure does seem that there’s a lot of opportunity in ChatGPT now, so at least this seems like a directionally-correct update for a lot of people to make (e.g. stop complaining so much about the possibility of an even bigger GPT-5; it’s almost certainly safer to scale it up than it is to come up with algorithms that can improve model power without improving model scale).
However, I do think there are a handful of places where this falls apart:
Your question briefly mentioned “with clever exploration bonuses”. LW didn’t really reply much to it, but it seems likely that this could be the thing that does your question in. Maybe there’s some model-free clever exploration bonuses, but if so I have never heard of them. The most advanced exploration bonuses I have heard of are from the Dreamer line of models, and it has precisely the sort of consequentialist reasoning abilities that start being dangerous.
My experience is that language models exhibit a phenomenon I call “transposons”, where (especially when fed back into themselves after deep levels of scaffolding) there are some ideas that they end up copying too much after prompting, clogging up the context window. I expect there will end up being strong incentives for people to come up with techniques to remove transposons, and I expect the most effective techniques will be based on some sort of consequences-based feedback system which again brings us back to essentially the original AI threat model.
I think security is going to be a big issue, along different lines: hostile nations, terrorists, criminals, spammers, trolls, competing companies, etc.. In order to achieve strong security, you need to be robust against adversarial attacks, which probably means continually coming up with new capabilities to fend them off. I guess one could imagine that humans will be coming up with those new capabilities, but that seems probably destroyed by adversaries using AIs to come up with security holes, and regardless it seems like having humans come up with the new capabilities will be extremely expensive, so probably people will focus on the AI side of things. To some extent, people might classify this as dual-use threats rather than alignment threats, but at least the arms race element would also generate alignment threats I think?
I think the way a lot of singularitarians thought of AI is that general agency consists of advanced adaptability and wisdom, and we expected that researchers would develop a lot of artificial adaptability, and then eventually the systems would become adaptable enough to generate wisdom faster than people do, and then overtake society. However what happened was that a relatively shallow form of adaptability was developed, and then people loaded all of human wisdom into that shallow adaptability, which turned out to be immensely profitable. But I’m not sure it’s going to continue to work this way; eventually we’re gonna run out of human wisdom to dump into the models, plus the models reduce the incentive for humans to share our wisdom in public. So just as a general principle, continued progress seems like it’s eventually going to switch to improving the ability to generate novel capabilities, which in turn is going to put us back to the traditional story? (Except possibly with a weirder takeoff? Slow takeoff followed by ??? takeoff or something. Idk.) Possibly this will be delayed because the creators of the LLMs find patches, e.g. I think we’re going to end up seeing the possibility of allowing people to “edit” the LLM responses rather than just doing upvotes/downvotes, but this again motivates some capabilities development because you need to filter out spammers/trolls/etc., which is probably best done through considering the consequences of the edits.
If we zoom out a bit, what’s going on here? Here’s some answers:
Conditioning on capabilities: There are strong reasons to expect capabilities to keep on improving, so we should condition on that. This leads to a lot of alignment issues, due to folk decision theory theorems. However, we don’t know how capabilities will develop, so we lose the ability to reason mechanistically when conditioning on capabilities, which means that there isn’t a mechanistic answer to all questions related to it. If one doesn’t keep track of this chain of reasoning, one might assume that there is a known mechanistic answer, which leads to the types of magical thinking seen in that other thread.
If people don’t follow-the-trying, that can lead to a lot of misattributions.
When considering agency in general, rather than consequentialist-based agency in particular, then instrumental convergence is more intuitive in disjunctive form than implication form. I.e. rather than “if an AI robustly achieves goals, then it will resist being shut down”, say “either an AI resists being shut down, or it doesn’t robustly achieve goals”.
Don’t trust rationalists too much.
One additional speculative thing I have been thinking of which is kind of off-topic and doesn’t fit in neatly with the other things:
Could there be “natural impact regularization” or “impact regularization by default”? Specifically, imagine you use general-purpose search procedure which recursively invokes itself to solve subgoals for the purpose of solving some bigger goal.
If the search procedure’s solutions to subgoals “change things too much”, then they’re probably not going to be useful. E.g. for Rubik’s cubes, if you want to swap some of the cuboids, it does you know good if those swaps leave the rest of the cube scrambled.
Thus, to some extent, powerful capabilities would have to rely on some sort of impact regularization.
The bias-focused rationalists like to map decision-theoretic insights to human mistakes, but I instead like to map decision-theoretic insights to human capacities and experiences. I’m thinking that natural impact regularization is related to the notion of “elegance” in engineering. Like if you have some bloated tool to solve a problem, then even if it’s not strictly speaking an issue because you can afford the resources, it might feel ugly because it’s excessive and puts mild constaints on your other underconstrained decisions, and so on. Meanwhile a simple, minimal solution often doesn’t have this.
Natural impact regularization wouldn’t guarantee safety, since it’s still allows deviations that don’t interfere with the AI’s function, but it sort of reduces one source of danger which I had been thinking about lately, namely I had been thinking that the instrumental incentive is to search for powerful methods of influencing the world, where “power” connotes the sort of raw power that unstoppably forces a lot of change, but really the instrumental incentive is often to search for “precise” methods of influencing the world, where one can push in a lot of information to effect narrow change. (A complication is that any one agent can only have so much bandwidth. I’ve been thinking bandwidth is probably going to become a huge area of agent foundations, and that it’s been underexplored so far. (Perhaps because everyone working in alignment sucks at managing their bandwidth?))
Maybe another word for it would be “natural inner alignment”, since in a sense the point is that capabilities inevitably select for inner alignment.
I think deceptive alignment is still reasonably likely despite evidence from LLMs.
I agree with:
LLMs are not deceptively aligned and don’t really have inner goals in the sense that is scary
LLMs memorize a bunch of stuff
the kinds of reasoning that feed into deceptive alignment do not predict LLM behavior well
Adam on transformers does not have a super strong simplicity bias
without deceptive alignment, AI risk is a lot lower
LLMs not being deceptively aligned provides nonzero evidence against deceptive alignment (by conservation of evidence)
I predict I could pass the ITT for why LLMs are evidence that deceptive alignment is not likely.
however, I also note the following: LLMs are kind of bad at generalizing, and this makes them pretty bad at doing e.g novel research, or long horizon tasks. deceptive alignment conditions on models already being better at generalization and reasoning than current models.
my current hypothesis is that future models which generalize in a way closer to that predicted by mesaoptimization will also be better described as having a simplicity bias.
I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI
Note that “LLMs are evidence against this hypothesis” isn’t my main point here. The main claim is that the positive arguments for deceptive alignment are flimsy, and thus the prior is very low.
I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI
How would you imagine doing this? I understand your hypothesis to be “If a model generalises as if it’s a mesa-optimiser, then it’s better-described as having simplicity bias”. Are you imagining training systems that are mesa-optimisers (perhaps explicitly using some kind of model-based RL/inference-time planning and search/MCTS), and then trying to see if they tend to learn simple cross-episode inner goals which would be implied by a stronger implicity bias?
I find myself unsure which conclusion this is trying to argue for.
Here are some pretty different conclusions:
Deceptive alignment is <<1% likely (quite implausible) to be a problem prior to complete human obsolescence (maybe it’s a problem after human obsolescence for our trusted AI successors, but who cares).
There aren’t any solid arguments for deceptive alignment[1]. So, we certainly shouldn’t be confident in deceptive alignment (e.g. >90%), though we can’t total rule it out (prior to human obsolescene). Perhaps deceptive alignment is 15% likely to be a serious problem overall and maybe 10% likely to be a serious problem if we condition on fully obsoleting humanity via just scaling up LLM agents or similar (this is pretty close to what I think overall).
Deceptive alignment is <<1% likely for scaled up LLM agents (prior to human obsolescence). Who knows about other architectures.
There is a big difference between <<1% likely and 10% likely. I basically agree with “not much reason to expect deceptive alignment even in models which are behaviorally capable of implementing deceptive alignment”, but I don’t think this leaves me in a <<1% likely epistemic state.
Closest to the third, but I’d put it somewhere between .1% and 5%. I think 15% is way too high for some loose speculation about inductive biases, relative to the specificity of the predictions themselves.
There are some subskills to having consistent goals that I think will be selected for, at least when outcome-based RL starts working to get models to do long-horizon tasks. For example, the ability to not be distracted/nerdsniped into some different behavior by most stimuli while doing a task. The longer the horizon, the more selection—if you have to do a 10,000 step coding project, then the probability you get irrecoverably distracted on one step has to be below 1⁄10,000.
I expect some pretty sophisticated goal-regulation circuitry to develop as models get more capable, because humans need it, and this makes me pretty scared.
Instead, we enter the realm of tool AI which basically does what you say.
I agree that, conditional on no deceptive alignment, the most pernicious and least tractable sources of doom go away.
However, I disagree that conditional on no deceptive alignment, AI “basically does what you say.” Indeed, the majority of my P(doom) comes from the difference between “looks good to human evaluators” and “is actually what the human evaluators wanted.” Concretely, this could play out with models which manipulate their users into thinking everything is going well and sensor tamper.
I think current observations don’t provide much evidence about whether these concerns will pan out: with current models and training set-ups, “looks good to evaluators” almost always coincides with “is what evaluators wanted.” I worry that we’ll only see this distinction matter once models are smart enough that they could competently deceive their overseers if they were trying (because of the same argument made here). (Forms of sycophancy where models knowingly assert false statements when they expect the user will agree are somewhat relevant, but there are also benign reasons models might do this.)
I think my crux is that once we remove the deceptive alignment issue, I suspect that profit forces alone will generate a large incentive to reduce the gap, primarily because I think that people way overestimate the value of powerful, unaligned agents to the market and underestimate the want for control over AI.
I contest that there’s very little reason to expect “undesired, covert, and consistent-across-situations inner goals” to crop up in [LLMs as trained today] to begin with
As someone who consider deceptive alignment a concern: fully agree. (With the caveat, of course, that it’s because I don’t expect LLMs to scale to AGI.)
I think there’s in general a lot of speaking-past-each-other in alignment, and what precisely people mean by “problem X will appear if we continue advancing/scaling” is one of them.
Like, of course a new problem won’t appear if we just keep doing the exact same thing that we’ve already been doing. Except “the exact same thing” is actually some equivalence class of approaches/architectures/training processes, but which equivalence class people mean can differ.
For example:
Person A, who’s worried about deceptive alignment, can have “scaling LLMs arbitrarily far” defined as this proven-safe equivalence class of architectures. So when they say they’re worried about capability advancement bringing in new problems, what they mean is “if we move beyond the LLM paradigm, deceptive alignment may appear”.
Person B, hearing the first one, might model them as instead defining “LLMs trained with N amount of compute” as the proven-safe architecture class, and so interpret their words as “if we keep scaling LLMs beyond N, they may suddenly develop this new problem”. Which, on B’s model of how LLMs work, may seem utterly ridiculous.
And the tricky thing is that Person A likely equates the “proven-safe” architecture class with the “doesn’t scale to AGI” architecture class – so they actually expect the major AI labs to venture outside that class, the moment they realize its limitations. Person B, conversely, might disagree, might think those classes are different, and that safely limited models can scale to AGI/interesting capabilities. (As a Person A in this situation, I think Person B’s model of cognition is confused; but that’s a different topic.)
Which is all important disconnects to watch out for.
(Uh, caveat: I think some people actually are worried about scaled-up LLMs exhibiting deceptive alignment, even without architectural changes. But I would delineate this cluster of views from the one I put myself in, and which I outline there. And, likewise, I expect that the other people I would tentatively classify as belonging to this cluster – Eliezer, Nate, John – mostly aren’t worried about just-scaling-the-LLMs to be leading to deceptive alignment.)
I think it’s important to put more effort into tracking such definitional issues, though. People end up overstating things because they round off their interlocutors’ viewpoint to their own. For instance if person C asks “is it safe to scale generative language pre-training and ChatGPT-style DPO arbitrarily far?”, when person D then rounds this off to “is it safe to make transformer-based LLMs as powerful as possible?” and explains that “no, because instrumental convergence and compression priors”, this is probably just false for the original meaning of the statement.
If this repeatedly happens to the point of generating a consensus for the false claim, then that can push the alignment community severely off track.
LLMs will soon scale beyond the available natural text data, and generation of synthetic data is some sort of change of architecture, potentially a completely different source of capabilities. So scaling LLMs without change of architecture much further is an expectation about something counterfactual. It makes sense as a matter of theory, but it’s not relevant for forecasting.
The main result is that up to 4 repetitions are about as good as unique data,
and for up to about 16 repetitions there is still meaningful improvement.
Let’s take 50T tokens as an estimate for available text data
(as an anchor, there’s a filtered and deduplicated
CommonCrawl dataset RedPajama-Data-v2
with 30T tokens).
Repeated 4 times, it can make good use of 1e28 FLOPs (with a dense transformer),
and repeated 16 times, suboptimal but meaningful use of 2e29 FLOPs.
So this is close but not lower than what can be put to use within a few years.
Thanks for pushing back on the original claim.
Three points: how much compute is going into a training run,
how much natural text data it wants, and how much data is available.
For training compute, there are claims of multi-billion dollar runs being
plausible and possibly planned in 2-5 years.
Eyeballing various trends and GPU shipping numbers and revenues,
it looks like about 3 OOMs of compute scaling is possible
before industrial capacity constrains the trend and the scaling slows down.
This assumes that there are no overly dramatic profits from AI
(which might lead to finding ways of scaling supply chains faster than usual),
and no overly dramatic lack of new capabilities with further scaling
(which would slow down investment in scaling).
That gives about 1e28-1e29 FLOPs at the slowdown in 4-6 years.
At 1e28 FLOPs, Chinchilla scaling asks for 200T-250T tokens.
Various sparsity techniques increase effective compute,
asking for even more tokens
(when optimizing loss given fixed hardware compute). Edit 15 Dec: I no longer endorse this point, based on scaling laws for training on repeated data.
On the outside, there are 20M-150M accessible books,
some text from video,
and 1T web pages of extremely dubious uniqueness and quality.
That might give about 100T tokens, if LLMs are used to curate?
There’s some discussion (incl. comments) here,
this is the figure I’m most uncertain about.
In practice, absent good synthetic data,
I expect multimodality to fill the gap,
but that’s not going to be as useful as good text
for improving chatbot competence.
(Possibly the issue with the original claim in the grandparent
is what I meant by “soon”.)
I’m not sure how much I expect something like deceptive alignment from just scaling up LLMs. My guess would be that in some limit this gets AGI and also something deceptively aligned by default, but in reality we end up taking a shorter path to AGI, which also leads to deceptive alignment by default. However, I can think about the LLM AGI in this comment, and I don’t think it changes much.
The main reason I expect something like deceptive alignment is that we are expecting the thing to actually be really powerful, and so it actually needs to be coherent over both long sequences of actions and into new distributions it wasn’t trained on. It seems pretty unlikely to end up in a world where we train for the AI to act aligned for one distribution and then it generalizes exactly as we would like in a very different distribution. Or at least that it generalizes the things that we care about it generalizing, for example:
Don’t seek power, but do gain resources enough to do the difficult task
Don’t manipulate humans, but do tell them useful things and get them to sign off on your plans
I don’t think I agree to your counters to the specific arguments about deceptive alignment:
For Quinton’s post, I think Steve Byrnes’ comment is a good approximation of my views here
I don’t fully know how to think about the counting arguments when things aren’t crispy divided, and I do agree that the LLM AGI would likely be a big mess with no separable “world model”, “objective”, or “planning engine”. However it really isn’t clear to me that this makes the case weaker; the AI will be doing something powerful (by assumption) and its not clear that it will do the powerful thing that we want.
I really strongly agree that the NN prior can be really hard to reason about. It really seems to me like this again makes the situation worse; we might have a really hard time thinking about how the big powerful AI will generalize when we ask it to do powerful stuff.
I think a lot of this comes from some intuition like “Condition on the AI being powerful enough to do the crazy stuff we’ll be asking of an AGI. If it is this capable but doesn’t generalize the task in the exact way you want then you get something that looks like deceptive alignment.”
Not being able to think about the NN prior really just seems to make things harder, because we don’t know how it will generalize things we tell it.
Conditional on the AI never doing something like: manipulating/deceiving[1] the humans such that the humans think the AI is aligned, such that the AI can later do things the humans don’t like, then I am much more optimistic about the whole situation.
I think a lot of this probably comes back to way overestimating the complexity of human values. I think a very deeply held belief of a lot of LWers is that human values are intractably complicated and gene/societal-specific, and I think if this was the case, the argument would actually be a little concerning, as we’d have to rely on massive speed biases to punish deception.
These posts gave me good intuition for why human value is likely to be quite simple, one of them talks about how most of the complexity of the values is inaccessible to the genome, thus it needs to start from far less complexity than people realize, because nearly all of it needs to be learned. Some other posts from Steven Byrnes are relevant, which talks about how simple the brain is, and a potential difference between me and Steven Byrnes is that the same process of learning from scratch algorithms that generate capabilities also applies to values, and thus the complexity of value is upper-bounded by the complexity of learning from scratch algorithms + genetic priors, both of which are likely very low, at the very least not billions of lines complex, and closer to thousands of lines/hundreds of bits.
But the reason this matters is because we no longer have good reason to assume that the deceptive model is so favored on priors like Evan Hubinger says here, as the complexity is likely massively lower than LWers assume.
Putting it another way, the deceptive and aligned models both have very similar complexities, and the relative difficulty is very low, so much so that the aligned model might be outright lower complexity, but even if that fails, the desired goal has a complexity very similar to the undesired goal complexity, thus the relative difficulty of actual alignment compared to deceptive alignment is quite low.
My point here is that even conditional on the frame being correct, there are a lot of assumptions like “value is complicated” that I don’t buy, and a lot of these assumptions have a good chance of being false, which significantly impacts the downstream conclusions, and that matters because a lot of LWers probably either hold these beliefs or assume it tacitly in arguments like alignment is hard.
Many people seem to expect that reward will be the optimization target of really smart learned policies—that these policies will be reward optimizers. I strongly disagree. As I argue in this essay, reward is not, in general, that-which-is-optimized by RL agents. [...] Reward chisels cognitive grooves into an agent.
I do think that sufficiently sophisticated[1] RL policies trained on a predictable environment with a simple and consistent reward scheme probably will develop an internal model of the thing being rewarded, as a single salient thing, and separately that some learned models will learn to make longer-term-than-immediate predictions about the future. So as such I do expect “iterate through some likely actions, and choose one where the reward proxy is high” will at some point emerge as an available strategy for RL policies[2].
My impression is that it’s an open question to what extent that available strategy is a better-performing strategy than a more sphexish pile of “if the environment looks like this, execute that behavior” heuristics, given a fixed amount of available computational power. In the limit as the system’s computational power approaches infinite and the accuracy of its predictions about future world states approaches perfection, the argmax(EU) strategy gets reinforced more strongly than any other strategy, and so that ends up being what gets chiseled into the model’s cognition. But of course in that limit “just brute-force sha256 bro” is an optimal strategy in certain situations, so the extent to which the “in the limit” behavior resembles the “in the regimes we actually care about” behavior is debatable.
If I’m reading Ability to Solve Long Horizon Tasks Correlates With Wanting correctly, that post argues that you can’t get good performance on any task where the reward is distant in time from the actions unless your system is doing something like this.
I think there are two separate questions here, with possibly (and I suspect actually) very different answers:
How likely is deceptive alignment to arise in an LLM under SGD across a large very diverse pretraining set (such as a slice of the internet)?
How likely is deceptive alignment to be boosted in an LLM under SGD fine tuning followed by RL for HHH-behavior applied to a base model trained by 1.?
I think the obvious answer to 1. is that the LLM is going to attempt (limited by its available capacity and training set) to develop world models of everything that humans do that affects the contents of the Internet. One of the many things that humans do is pretend to be more aligned to the wishes of an authority that has power over them than they truly are. So for a large enough LLM, SGD will create a world model for this behavior along with thousands of other human behaviors, and the LLM will (depending on the prompt) tend to activate this behavior at about the frequency and level that you find it on the Internet, as modified by cues in the particular prompt. On the Internet, this is generally a mild background level for people writing while at work in Western countries, and probably more strongly for people writing from more authoritarian countries: specific prompts will be more or less correlated with this.
For 2., the question is whether fine-tuning followed by RL will settle on this preexisting mechanism and make heavy use of it as part of the way that it implements something that fits the fine-tuning set/scores well on the reward model aimed at creating a helpful, honest, and harmless assistant persona. I’m a lot less certain of the answer here, and I suspect it might depend rather strongly on the details of the training set. For example, is this evoking an “you’re at work, or in an authoritarian environment, so watch what you say and do” scenario that might boost the use of this particular behavior? The “harmless” element in HHH seems particularly concerning here: it suggests an environment in which certain things can’t be discussed, which tend to be the sorts of environments that evince this behavior more strongly in humans.
Two quick thoughts (that don’t engage deeply with this nice post).
I’m worried in some cases where the goal is not consistent across situations. For example, if prompted to pursue some goal, it then does it seriously with convergent instrumental goals.
I think it seems pretty likely that future iterations of transformers will have bits of powerful search in them, but people who seem very worried about that search seem to think that once that search is established enough, gradient descent will cause the internals of the model to be organised mostly around that search (I imagine the search circuits “bubbling out” to be the outer structure of the learned algorithm). Probably this is all just conceptually confused, but to the extent it’s not, I’m pretty surprised by their intuition.
To me, it seems that consequentialism is just a really good algorithm to perform very well on a very wide range of tasks. Therefore, for difficult enough tasks, I expect that consequentialism is the kind of algorithm that would be found because it’s the kind of algorithm that can perform the task well.
When we start training the system we have a random initialization. Let’s make the following simplification. We have a goal in the system somewhere and then we have the consequential reasoning algorithm in the system somewhere. As we train the system the consequential reasoning will get better and better and the goal will get more and more aligned with the outer objective because both of these things will improve performance. However, there will come a point in training where the consequential reasoning algorithm is good enough to realize that it is in training. And then bad things start to happen. It will try to figure out and optimize it for the outer objective. SGD will incentivize this kind of behavior because it performs better than not doing it.
There really is a lot more to this kind of argument. So far I have failed to write it up. I hope the above is enough to hint at why I think that it is possible that being deceptive is just better than not being deceptive in terms of performance. When you become deceptive, that aligns the consequentialist reasoner faster to the outer objective, compared to waiting for SGD to gradually correct everything. In fact it is sort of a constant boost to your performance to be deceptive, even before the system has become very good at optimizing for the true objective. A really good consequential reasoner could probably just get zero loss immediately by retargeting its consequential reasoner instead of waiting for the goal to be updated to match the outer objective by SGD, as soon as it got a perfect model of the outer objective.
I’m not sure that deception is a problem. Maybe it is not. but to me, it really seems like you don’t provide any reason that makes me confident that it won’t be a problem. It seems very strange to me to argue that this is not a thing because existing arguments are flimsy. I mean you did not even address my argument. It’s like trying to prove that all bananas are green by showing me 3 green bananas.
TL;DR: I think you’re right that much inner alignment conversation is overly-fanciful. But ‘LLMs will do what they’re told’ and ‘deceptive alignment is unjustified’ are non sequiturs. Maybe we mean something different by ‘LLM’? Either way, I think there’s a case for ‘inner homunculi’ after all.
I have always thought that the ‘inner’ conversation is missing something. On the one hand it’s moderately-clearly identifying a type of object, which is a point in favour. Further, as you’ve said yourself, ‘reward is not the optimisation target’ is very obvious but it seems to need spelling out repeatedly (a service the ‘inner/outer’ distinction provides). On the other hand the ‘inner alignment’ conversation seems in practice to distract from the actual issue which is ‘some artefacts are (could be) doing their own planning/deliberation/optimisation’, and ‘inner’ is only properly pointing at a subset of those. (We can totally build, including accidentally, artefacts which do this ‘outside’ the weights of NN.)
You’ve indeed pointed at a few of the more fanciful parts of that discussion here[1], like steganographic gradient hacking. NB steganography per se isn’t entirely wild; we see ML systems use available bandwidth to develop unintuitive communication protocols a lot e.g. in MARL.
I can only assume you mean something very narrow and limited by ‘continuing to scale up LLMs’[2]. I think without specifying what you mean, your words are likely to be misconstrued.
With that said, I think something not far off the ‘inner consequentialist’ is entirely plausible and consistent with observations. In short, how do plan-like outputs emerge without a planning-like computation?[3] I’d say ‘undesired, covert, and consistent-across-situations inner goals’ is a weakman of deceptive alignment. Specialising to LLMs, the point is that they’ve exhibited poorly-understood somewhat-general planning capability, and that not all ways of turning that into competent AI assistants[4] result in aligned planners. Planners can be deceptive (and we have evidence of this).
I think your step from ‘there’s very little reason to expect “undesired, covert, and consistent-across-situations inner goals”...’ to ‘LLMs will continue doing what they’re told...’ is therefore a boldly overconfident non sequitur. (I recognise that the second step there is presented as ‘an alternative’, so it’s not clear whether you actually think that.)
Incidentally, I’d go further and say that, to me (but I guess you disagree?) it’s at least plausible that some ways of turning LLM planning capability into a competent AI assistant actually do result in ‘undesired, covert, and consistent-across-situations inner goals’! Sketch in thread.
FWIW I’m more concerned about highly competent planning coming about by one or another kind of scaffolding, but such a system can just as readily be deceptively aligned as a ‘vanilla’ LLM. If enough of the planning is externalised and understandable and actually monitored in practice, this might be easier to catch.
I’m apparently more sympathetic to fancy than you; in absence of mechanistic observables and in order to extrapolate/predict, we have to apply some imagination. But I agree there’s a lot of fancy; some of it even looks like hallucination getting reified by self-conditioning(!), and that effort could probably be more efficiently spent.
I’m presently (quite badly IMO) trying to anticipate the shape of the next big step in get-things-done/autonomy.
I’ve had a hunch for a while that temporally abstract planning and prediction is key. I strongly suspect you can squeeze more consequential planning out of shortish serial depth than most people give credit for. This is informed by past RL-flavoured stuff like MuZero and its limitations, by observations of humans and animals (inc myself), and by general CS/algos thinking. Actually this is where I get on the LLM train. It seems to me that language is an ideal substrate for temporally abstract planning and prediction, and lots of language data in the wild exemplifies this. NB I don’t think GPTs or LLMs are uniquely on this trajectory, just getting a big bootstrap.
Now, if I had to make the most concrete ‘inner homunculus’ case off the cuff, I’d start in the vicinity of Good Regulator, except a more conjectury version regarding systems-predicting-planners (I am working on sharpening this). Maybe I’d point at Janus’ Simulators post. I suspect there might be something like an impossibility/intractability theorem for predicting planners of the right kind without running a planner of a similar kind. (Handwave!)
I’d observe that GPTs can predict planning-looking actions, including sometimes without CoT. (NOTE here’s where the most concrete and proximal evidence is!) This includes characters engaging in deceit. I’d invoke my loose reasoning regarding temporal abstraction to support the hypothesis that this is ‘more than mere parroting’, and maybe fish for examples quite far from obvious training settings to back this up. Interp would be super, of course! (Relatedly, some of your work on steering policies via activation editing has sparked my interest.)
I think maybe this is enough to transfer some sense of what I’m getting at? At this point, given some (patchy) theory, the evidence is supportive of (among other hypotheses) an ‘inner planning’ hypothesis (of quite indeterminate form).
Finally, one kind or another of ‘conditioning’ is hypothesised to reinforce the consequentialist component(s) ‘somehow’ (handwave again, though I’m hardly the only one guilty of handwaving about RLHF et al). I think it’s appropriate to be uncertain what form the inner planning takes, what form the conditioning can/will take, and what the eventual results of that are. Interested in evidence and theory around this area.
So, what are we talking about when we say ‘LLM’? Plain GPT? Well, they definitely don’t ‘do what they’re told’[1]. They exhibit planning-like outputs with the right prompts, typically associated with ‘simulated characters’ at some resolution or other. What about RLHFed GPTs? Well, they sometimes ‘do what they’re told’. They also exhibit planning-like outputs with the right prompts, and it’s mechanistically very unclear how they’re getting them.
unless you mean predicting the next token (I’m pretty sure you don’t mean this?), which they do quite well, though we don’t know how, nor when it’ll fail
Deceptive alignment seems to only be supported by flimsy arguments. I recently realized that I don’t have good reason to believe that continuing to scale up LLMs will lead to inner consequentialist cognition to pursue a goal which is roughly consistent across situations. That is: a model which not only does what you ask it to do (including coming up with agentic plans), but also thinks about how to make more paperclips even while you’re just asking about math homework.
Aside: This was kinda a “holy shit” moment, and I’ll try to do it justice here. I encourage the reader to do a serious dependency check on their beliefs. What do you think you know about deceptive alignment being plausible, and why do you think you know it? Where did your beliefs truly come from, and do those observations truly provide
P(observations∣deceptive alignment is how AI works)P(observations∣deceptive alignment is not how AI works)>>1?I agree that conditional on entraining consequentialist cognition which has a “different goal” (as thought of by MIRI; this isn’t a frame I use), the AI will probably instrumentally reason about whether and how to deceptively pursue its own goals, to our detriment.
I contest that there’s very little reason to expect “undesired, covert, and consistent-across-situations inner goals” to crop up in LLMs to begin with. An example alternative prediction is:
Not only is this performant, it seems to be what we actually observe today. The AI can pursue goals when prompted to do so, but it isn’t pursuing them on its own. It basically follows instructions in a reasonable way, just like GPT-4 usually does.
Why should we believe the “consistent-across-situations inner goals → deceptive alignment” mechanistic claim about how SGD works? Here are the main arguments I’m aware of:
Analogies to evolution (e.g. page 6 of Risks from Learned Optimization)
I think these loose analogies provide basically no evidence about what happens in an extremely different optimization process (SGD to train LLMs).
Counting arguments: there are more unaligned goals than aligned goals (e.g. as argued in How likely is deceptive alignment?)
These ignore the importance of the parameter->function map. (They’re counting functions when they need to be counting parameterizations.) Classical learning theory made the (mechanistically) same mistake in predicting that overparameterized models would fail to generalize.
I also basically deny the relevance of the counting argument, because I don’t buy the assumption of “there’s gonna be an inner ‘objective’ distinct from inner capabilities; let’s make a counting argument about what that will be.”
Speculation about simplicity bias: SGD will entrain consequentialism because that’s a simple algorithm for “getting low loss”
But we already know that simplicity bias in the NN prior can be really hard to reason about.
I think it’s unrealistic to imagine that we have the level of theoretical precision to go “it’ll be a future training process and the model is ‘getting selected for low loss’, so I can now make this very detailed prediction about the inner mechanistic structure.”[1]
I falsifiably predict that if you try to use this kind of logic or counting argument today to make falsifiable predictions about unobserved LLM generalization, you’re going to lose Bayes points left and right.
Without deceptive alignment/agentic AI opposition, a lot of alignment threat models ring hollow. No more adversarial steganography or adversarial pressure on your grading scheme or worst-case analysis or unobservable, nearly unfalsifiable inner homonculi whose goals have to be perfected.
Instead, I think that we enter the realm of tool AI[2] which basically does what you say.[3] I think that world’s a lot friendlier, even though there are still some challenges I’m worried about—like an AI being scaffolded into pursuing consistent goals. (I think that’s a very substantially different risk regime, though)
(Even though this predicted mechanistic structure doesn’t have any apparent manifestation in current reality.)
Tool AI which can be purposefully scaffolded into agentic systems, which somewhat handles objections from Amdahl’s law.
This is what we actually have today, in reality. In these setups, the agency comes from the system of subroutine calls to the LLM during e.g. a plan/critique/execute/evaluate loop a la AutoGPT.
I agree a fraction of the way, but when doing a dependency check, I feel like there are some conditions where the standard arguments go through.
I sketched out my view on the dependencies Where do you get your capabilities from?. The TL;DR is that I think ChatGPT-style training basically consists of two different ways of obtaining capabilities:
Imitating internet text, which gains them capabilities to do the-sorts-of-things-humans-do because generating such text requires some such capabilities.
Reinforcement learning from human feedback on plans, where people evaluate the implications of the proposals the AI comes up with, and rate how good they are.
I think both of these are basically quite safe. They do have some issues, but probably not of the style usually discussed by rationalists working in AI alignment, and possibly not even issues going beyond any other technological development.
The basic principle for why they are safe-ish is that all of the capabilities they gain are obtained through human capabilities. So for example, while RLHF-on-plans may optimize for tricking human raters to the detriment of how the rater intended the plans to work out, this “tricking” will also sacrifice the capabilities of the plans, because the only reason more effective plans are rated better is because humans recognize their effectiveness and rate them better.
Or relatedly, consider nearest unblocked strategy, a common proposal for why alignment is hard. This only applies if the AI is able to consider an infinitude of strategies, which again only applies if it can generate its own strategies once the original human-generated strategies have been blocked.
These are the main arguments that the rationalist community seems to be pushing about it. For instance, one time you asked about it, and LW just encouraged magical thinking around these arguments. (Even from people who I’d have thought would be clearer-thinking)
The non-magical way I’d analyze the smiling reward is that while people imagine that in theory updating the policy with antecedent-computation-reinforcement should make it maximize reward, in practice the information signal in this is very sparse and imprecise, so in practice it is going to take exponential time, and therefore something else will happen beforehand.
What is this “something else”? Probably something like: the human is going to reason about whether the AI’s activities are making progress on something desirable, and in those cases, the human is going to press the antecedent-computation-reinforcement button. Which again boils down to the AI copying the human’s capabilities, and therefore being relatively safe. (E.g. if you start seeing the AI studying how to deceive humans, you’re probably gonna punish it instead, or at least if you reward it then it more falls under the dual-use risk framework than the alignment risk framework.)
(Of course one could say “what if we just instruct the human to only reward the AI based on results and not incremental progress?”, but in that case the answer to what is going to happen before the AI does a treacherous turn is “the company training the AI runs out of money”.)
There’s something seriously wrong with how LWers fixated on reward-for-smiling and rationalized an explanation of why simplicity bias or similar would make this go into a treacherous turn.
OK, so this is how far I’m with you. Rationalist stories of AI progress are basically very wrong, and some commonly endorsed threat models don’t point at anything serious. But what then?
The short technical answer is that reward is only antecedent-computation-reinforcement for policy-gradient-based reinforcement learning, and model-based approaches (traditionally temporal-difference learning, but I think the SOTA is DreamerV3, which is pretty cool) use the reward to learn a value function, which they then optimize in a more classical way, allowing them to create novel capabilities in precisely the way that can eventually lead to deception, treacherous turn, etc..
One obvious proposal is “shit, let’s not do that, chatgpt seems like a pretty good and safe alternative, and it’s not like there’s any hype behind this anyway”. I’m not sure that proposal is right, because e.g. AlphaStar was pretty hype, and it was trained with one of these methods. But it sure does seem that there’s a lot of opportunity in ChatGPT now, so at least this seems like a directionally-correct update for a lot of people to make (e.g. stop complaining so much about the possibility of an even bigger GPT-5; it’s almost certainly safer to scale it up than it is to come up with algorithms that can improve model power without improving model scale).
However, I do think there are a handful of places where this falls apart:
Your question briefly mentioned “with clever exploration bonuses”. LW didn’t really reply much to it, but it seems likely that this could be the thing that does your question in. Maybe there’s some model-free clever exploration bonuses, but if so I have never heard of them. The most advanced exploration bonuses I have heard of are from the Dreamer line of models, and it has precisely the sort of consequentialist reasoning abilities that start being dangerous.
My experience is that language models exhibit a phenomenon I call “transposons”, where (especially when fed back into themselves after deep levels of scaffolding) there are some ideas that they end up copying too much after prompting, clogging up the context window. I expect there will end up being strong incentives for people to come up with techniques to remove transposons, and I expect the most effective techniques will be based on some sort of consequences-based feedback system which again brings us back to essentially the original AI threat model.
I think security is going to be a big issue, along different lines: hostile nations, terrorists, criminals, spammers, trolls, competing companies, etc.. In order to achieve strong security, you need to be robust against adversarial attacks, which probably means continually coming up with new capabilities to fend them off. I guess one could imagine that humans will be coming up with those new capabilities, but that seems probably destroyed by adversaries using AIs to come up with security holes, and regardless it seems like having humans come up with the new capabilities will be extremely expensive, so probably people will focus on the AI side of things. To some extent, people might classify this as dual-use threats rather than alignment threats, but at least the arms race element would also generate alignment threats I think?
I think the way a lot of singularitarians thought of AI is that general agency consists of advanced adaptability and wisdom, and we expected that researchers would develop a lot of artificial adaptability, and then eventually the systems would become adaptable enough to generate wisdom faster than people do, and then overtake society. However what happened was that a relatively shallow form of adaptability was developed, and then people loaded all of human wisdom into that shallow adaptability, which turned out to be immensely profitable. But I’m not sure it’s going to continue to work this way; eventually we’re gonna run out of human wisdom to dump into the models, plus the models reduce the incentive for humans to share our wisdom in public. So just as a general principle, continued progress seems like it’s eventually going to switch to improving the ability to generate novel capabilities, which in turn is going to put us back to the traditional story? (Except possibly with a weirder takeoff? Slow takeoff followed by ??? takeoff or something. Idk.) Possibly this will be delayed because the creators of the LLMs find patches, e.g. I think we’re going to end up seeing the possibility of allowing people to “edit” the LLM responses rather than just doing upvotes/downvotes, but this again motivates some capabilities development because you need to filter out spammers/trolls/etc., which is probably best done through considering the consequences of the edits.
If we zoom out a bit, what’s going on here? Here’s some answers:
Conditioning on capabilities: There are strong reasons to expect capabilities to keep on improving, so we should condition on that. This leads to a lot of alignment issues, due to folk decision theory theorems. However, we don’t know how capabilities will develop, so we lose the ability to reason mechanistically when conditioning on capabilities, which means that there isn’t a mechanistic answer to all questions related to it. If one doesn’t keep track of this chain of reasoning, one might assume that there is a known mechanistic answer, which leads to the types of magical thinking seen in that other thread.
If people don’t follow-the-trying, that can lead to a lot of misattributions.
When considering agency in general, rather than consequentialist-based agency in particular, then instrumental convergence is more intuitive in disjunctive form than implication form. I.e. rather than “if an AI robustly achieves goals, then it will resist being shut down”, say “either an AI resists being shut down, or it doesn’t robustly achieve goals”.
Don’t trust rationalists too much.
One additional speculative thing I have been thinking of which is kind of off-topic and doesn’t fit in neatly with the other things:
Could there be “natural impact regularization” or “impact regularization by default”? Specifically, imagine you use general-purpose search procedure which recursively invokes itself to solve subgoals for the purpose of solving some bigger goal.
If the search procedure’s solutions to subgoals “change things too much”, then they’re probably not going to be useful. E.g. for Rubik’s cubes, if you want to swap some of the cuboids, it does you know good if those swaps leave the rest of the cube scrambled.
Thus, to some extent, powerful capabilities would have to rely on some sort of impact regularization.
The bias-focused rationalists like to map decision-theoretic insights to human mistakes, but I instead like to map decision-theoretic insights to human capacities and experiences. I’m thinking that natural impact regularization is related to the notion of “elegance” in engineering. Like if you have some bloated tool to solve a problem, then even if it’s not strictly speaking an issue because you can afford the resources, it might feel ugly because it’s excessive and puts mild constaints on your other underconstrained decisions, and so on. Meanwhile a simple, minimal solution often doesn’t have this.
Natural impact regularization wouldn’t guarantee safety, since it’s still allows deviations that don’t interfere with the AI’s function, but it sort of reduces one source of danger which I had been thinking about lately, namely I had been thinking that the instrumental incentive is to search for powerful methods of influencing the world, where “power” connotes the sort of raw power that unstoppably forces a lot of change, but really the instrumental incentive is often to search for “precise” methods of influencing the world, where one can push in a lot of information to effect narrow change. (A complication is that any one agent can only have so much bandwidth. I’ve been thinking bandwidth is probably going to become a huge area of agent foundations, and that it’s been underexplored so far. (Perhaps because everyone working in alignment sucks at managing their bandwidth?))
Maybe another word for it would be “natural inner alignment”, since in a sense the point is that capabilities inevitably select for inner alignment.
Sorry if I’m getting too rambly.
I think deceptive alignment is still reasonably likely despite evidence from LLMs.
I agree with:
LLMs are not deceptively aligned and don’t really have inner goals in the sense that is scary
LLMs memorize a bunch of stuff
the kinds of reasoning that feed into deceptive alignment do not predict LLM behavior well
Adam on transformers does not have a super strong simplicity bias
without deceptive alignment, AI risk is a lot lower
LLMs not being deceptively aligned provides nonzero evidence against deceptive alignment (by conservation of evidence)
I predict I could pass the ITT for why LLMs are evidence that deceptive alignment is not likely.
however, I also note the following: LLMs are kind of bad at generalizing, and this makes them pretty bad at doing e.g novel research, or long horizon tasks. deceptive alignment conditions on models already being better at generalization and reasoning than current models.
my current hypothesis is that future models which generalize in a way closer to that predicted by mesaoptimization will also be better described as having a simplicity bias.
I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI
Note that “LLMs are evidence against this hypothesis” isn’t my main point here. The main claim is that the positive arguments for deceptive alignment are flimsy, and thus the prior is very low.
How would you imagine doing this? I understand your hypothesis to be “If a model generalises as if it’s a mesa-optimiser, then it’s better-described as having simplicity bias”. Are you imagining training systems that are mesa-optimisers (perhaps explicitly using some kind of model-based RL/inference-time planning and search/MCTS), and then trying to see if they tend to learn simple cross-episode inner goals which would be implied by a stronger implicity bias?
I find myself unsure which conclusion this is trying to argue for.
Here are some pretty different conclusions:
Deceptive alignment is <<1% likely (quite implausible) to be a problem prior to complete human obsolescence (maybe it’s a problem after human obsolescence for our trusted AI successors, but who cares).
There aren’t any solid arguments for deceptive alignment[1]. So, we certainly shouldn’t be confident in deceptive alignment (e.g. >90%), though we can’t total rule it out (prior to human obsolescene). Perhaps deceptive alignment is 15% likely to be a serious problem overall and maybe 10% likely to be a serious problem if we condition on fully obsoleting humanity via just scaling up LLM agents or similar (this is pretty close to what I think overall).
Deceptive alignment is <<1% likely for scaled up LLM agents (prior to human obsolescence). Who knows about other architectures.
There is a big difference between <<1% likely and 10% likely. I basically agree with “not much reason to expect deceptive alignment even in models which are behaviorally capable of implementing deceptive alignment”, but I don’t think this leaves me in a <<1% likely epistemic state.
Other than noting that it could be behaviorally consistent for powerful models: powerful models are capable of deceptive alignment.
Closest to the third, but I’d put it somewhere between .1% and 5%. I think 15% is way too high for some loose speculation about inductive biases, relative to the specificity of the predictions themselves.
There are some subskills to having consistent goals that I think will be selected for, at least when outcome-based RL starts working to get models to do long-horizon tasks. For example, the ability to not be distracted/nerdsniped into some different behavior by most stimuli while doing a task. The longer the horizon, the more selection—if you have to do a 10,000 step coding project, then the probability you get irrecoverably distracted on one step has to be below 1⁄10,000.
I expect some pretty sophisticated goal-regulation circuitry to develop as models get more capable, because humans need it, and this makes me pretty scared.
I agree that, conditional on no deceptive alignment, the most pernicious and least tractable sources of doom go away.
However, I disagree that conditional on no deceptive alignment, AI “basically does what you say.” Indeed, the majority of my P(doom) comes from the difference between “looks good to human evaluators” and “is actually what the human evaluators wanted.” Concretely, this could play out with models which manipulate their users into thinking everything is going well and sensor tamper.
I think current observations don’t provide much evidence about whether these concerns will pan out: with current models and training set-ups, “looks good to evaluators” almost always coincides with “is what evaluators wanted.” I worry that we’ll only see this distinction matter once models are smart enough that they could competently deceive their overseers if they were trying (because of the same argument made here). (Forms of sycophancy where models knowingly assert false statements when they expect the user will agree are somewhat relevant, but there are also benign reasons models might do this.)
To what extent do you worry about the training methods used for ChatGPT, and why?
I think my crux is that once we remove the deceptive alignment issue, I suspect that profit forces alone will generate a large incentive to reduce the gap, primarily because I think that people way overestimate the value of powerful, unaligned agents to the market and underestimate the want for control over AI.
As someone who consider deceptive alignment a concern: fully agree. (With the caveat, of course, that it’s because I don’t expect LLMs to scale to AGI.)
I think there’s in general a lot of speaking-past-each-other in alignment, and what precisely people mean by “problem X will appear if we continue advancing/scaling” is one of them.
Like, of course a new problem won’t appear if we just keep doing the exact same thing that we’ve already been doing. Except “the exact same thing” is actually some equivalence class of approaches/architectures/training processes, but which equivalence class people mean can differ.
For example:
Person A, who’s worried about deceptive alignment, can have “scaling LLMs arbitrarily far” defined as this proven-safe equivalence class of architectures. So when they say they’re worried about capability advancement bringing in new problems, what they mean is “if we move beyond the LLM paradigm, deceptive alignment may appear”.
Person B, hearing the first one, might model them as instead defining “LLMs trained with N amount of compute” as the proven-safe architecture class, and so interpret their words as “if we keep scaling LLMs beyond N, they may suddenly develop this new problem”. Which, on B’s model of how LLMs work, may seem utterly ridiculous.
And the tricky thing is that Person A likely equates the “proven-safe” architecture class with the “doesn’t scale to AGI” architecture class – so they actually expect the major AI labs to venture outside that class, the moment they realize its limitations. Person B, conversely, might disagree, might think those classes are different, and that safely limited models can scale to AGI/interesting capabilities. (As a Person A in this situation, I think Person B’s model of cognition is confused; but that’s a different topic.)
Which is all important disconnects to watch out for.
(Uh, caveat: I think some people actually are worried about scaled-up LLMs exhibiting deceptive alignment, even without architectural changes. But I would delineate this cluster of views from the one I put myself in, and which I outline there. And, likewise, I expect that the other people I would tentatively classify as belonging to this cluster – Eliezer, Nate, John – mostly aren’t worried about just-scaling-the-LLMs to be leading to deceptive alignment.)
I think it’s important to put more effort into tracking such definitional issues, though. People end up overstating things because they round off their interlocutors’ viewpoint to their own. For instance if person C asks “is it safe to scale generative language pre-training and ChatGPT-style DPO arbitrarily far?”, when person D then rounds this off to “is it safe to make transformer-based LLMs as powerful as possible?” and explains that “no, because instrumental convergence and compression priors”, this is probably just false for the original meaning of the statement.
If this repeatedly happens to the point of generating a consensus for the false claim, then that can push the alignment community severely off track.
LLMs will soon scale beyond the available natural text data, and generation of synthetic data is some sort of change of architecture, potentially a completely different source of capabilities. So scaling LLMs without change of architecture much further is an expectation about something counterfactual. It makes sense as a matter of theory, but it’s not relevant for forecasting.
Edit 15 Dec: No longer endorsed based on scaling laws for training on repeated data.
Bold claim. Want to make any concrete predictions so that I can register my different beliefs?
I’ve now changed my mind based on
N Muennighoff et al. (2023) Scaling Data-Constrained Language Models
The main result is that up to 4 repetitions are about as good as unique data, and for up to about 16 repetitions there is still meaningful improvement. Let’s take 50T tokens as an estimate for available text data (as an anchor, there’s a filtered and deduplicated CommonCrawl dataset RedPajama-Data-v2 with 30T tokens). Repeated 4 times, it can make good use of 1e28 FLOPs (with a dense transformer), and repeated 16 times, suboptimal but meaningful use of 2e29 FLOPs. So this is close but not lower than what can be put to use within a few years. Thanks for pushing back on the original claim.
Three points: how much compute is going into a training run, how much natural text data it wants, and how much data is available. For training compute, there are claims of multi-billion dollar runs being plausible and possibly planned in 2-5 years. Eyeballing various trends and GPU shipping numbers and revenues, it looks like about 3 OOMs of compute scaling is possible before industrial capacity constrains the trend and the scaling slows down. This assumes that there are no overly dramatic profits from AI (which might lead to finding ways of scaling supply chains faster than usual), and no overly dramatic lack of new capabilities with further scaling (which would slow down investment in scaling). That gives about 1e28-1e29 FLOPs at the slowdown in 4-6 years.
At 1e28 FLOPs, Chinchilla scaling asks for 200T-250T tokens. Various sparsity techniques increase effective compute, asking for even more tokens (when optimizing loss given fixed hardware compute).
Edit 15 Dec: I no longer endorse this point, based on scaling laws for training on repeated data.
On the outside, there are 20M-150M accessible books, some text from video, and 1T web pages of extremely dubious uniqueness and quality. That might give about 100T tokens, if LLMs are used to curate? There’s some discussion (incl. comments) here, this is the figure I’m most uncertain about. In practice, absent good synthetic data, I expect multimodality to fill the gap, but that’s not going to be as useful as good text for improving chatbot competence. (Possibly the issue with the original claim in the grandparent is what I meant by “soon”.)
I’m not sure how much I expect something like deceptive alignment from just scaling up LLMs. My guess would be that in some limit this gets AGI and also something deceptively aligned by default, but in reality we end up taking a shorter path to AGI, which also leads to deceptive alignment by default. However, I can think about the LLM AGI in this comment, and I don’t think it changes much.
The main reason I expect something like deceptive alignment is that we are expecting the thing to actually be really powerful, and so it actually needs to be coherent over both long sequences of actions and into new distributions it wasn’t trained on. It seems pretty unlikely to end up in a world where we train for the AI to act aligned for one distribution and then it generalizes exactly as we would like in a very different distribution. Or at least that it generalizes the things that we care about it generalizing, for example:
Don’t seek power, but do gain resources enough to do the difficult task
Don’t manipulate humans, but do tell them useful things and get them to sign off on your plans
I don’t think I agree to your counters to the specific arguments about deceptive alignment:
For Quinton’s post, I think Steve Byrnes’ comment is a good approximation of my views here
I don’t fully know how to think about the counting arguments when things aren’t crispy divided, and I do agree that the LLM AGI would likely be a big mess with no separable “world model”, “objective”, or “planning engine”. However it really isn’t clear to me that this makes the case weaker; the AI will be doing something powerful (by assumption) and its not clear that it will do the powerful thing that we want.
I really strongly agree that the NN prior can be really hard to reason about. It really seems to me like this again makes the situation worse; we might have a really hard time thinking about how the big powerful AI will generalize when we ask it to do powerful stuff.
I think a lot of this comes from some intuition like “Condition on the AI being powerful enough to do the crazy stuff we’ll be asking of an AGI. If it is this capable but doesn’t generalize the task in the exact way you want then you get something that looks like deceptive alignment.”
Not being able to think about the NN prior really just seems to make things harder, because we don’t know how it will generalize things we tell it.
Conditional on the AI never doing something like: manipulating/deceiving[1] the humans such that the humans think the AI is aligned, such that the AI can later do things the humans don’t like, then I am much more optimistic about the whole situation.
The AI could be on some level not “aware” that it was deceiving the humans, a la Deep Deceptiveness.
I wish I had read this a week ago instead of just now, it would have saved a significant amount of confusion and miscommunication!
I think a lot of this probably comes back to way overestimating the complexity of human values. I think a very deeply held belief of a lot of LWers is that human values are intractably complicated and gene/societal-specific, and I think if this was the case, the argument would actually be a little concerning, as we’d have to rely on massive speed biases to punish deception.
These posts gave me good intuition for why human value is likely to be quite simple, one of them talks about how most of the complexity of the values is inaccessible to the genome, thus it needs to start from far less complexity than people realize, because nearly all of it needs to be learned. Some other posts from Steven Byrnes are relevant, which talks about how simple the brain is, and a potential difference between me and Steven Byrnes is that the same process of learning from scratch algorithms that generate capabilities also applies to values, and thus the complexity of value is upper-bounded by the complexity of learning from scratch algorithms + genetic priors, both of which are likely very low, at the very least not billions of lines complex, and closer to thousands of lines/hundreds of bits.
But the reason this matters is because we no longer have good reason to assume that the deceptive model is so favored on priors like Evan Hubinger says here, as the complexity is likely massively lower than LWers assume.
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument?commentId=vXnLq7X6pMFLKwN2p
Putting it another way, the deceptive and aligned models both have very similar complexities, and the relative difficulty is very low, so much so that the aligned model might be outright lower complexity, but even if that fails, the desired goal has a complexity very similar to the undesired goal complexity, thus the relative difficulty of actual alignment compared to deceptive alignment is quite low.
https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome#
https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8/p/wBHSYwqssBGCnwvHg
https://www.lesswrong.com/posts/PTkd8nazvH9HQpwP8/building-brain-inspired-agi-is-infinitely-easier-than
https://www.lesswrong.com/posts/aodPs8H9dQxpXAcwk/heritability-behaviorism-and-within-lifetime-rl
(I think you’re still playing into an incorrect frame by talking about “simplicity” or “speed biases.”)
My point here is that even conditional on the frame being correct, there are a lot of assumptions like “value is complicated” that I don’t buy, and a lot of these assumptions have a good chance of being false, which significantly impacts the downstream conclusions, and that matters because a lot of LWers probably either hold these beliefs or assume it tacitly in arguments like alignment is hard.
Also, for a defense of wrong models, see here:
https://www.lesswrong.com/posts/q5Gox77ReFAy5i2YQ/in-defense-of-probably-wrong-mechanistic-models
Interesting! I had thought this already was your take, based on posts like Reward is not the Optimization Target.
I do think that sufficiently sophisticated[1] RL policies trained on a predictable environment with a simple and consistent reward scheme probably will develop an internal model of the thing being rewarded, as a single salient thing, and separately that some learned models will learn to make longer-term-than-immediate predictions about the future. So as such I do expect “iterate through some likely actions, and choose one where the reward proxy is high” will at some point emerge as an available strategy for RL policies[2].
My impression is that it’s an open question to what extent that available strategy is a better-performing strategy than a more sphexish pile of “if the environment looks like this, execute that behavior” heuristics, given a fixed amount of available computational power. In the limit as the system’s computational power approaches infinite and the accuracy of its predictions about future world states approaches perfection, the
argmax(EU)
strategy gets reinforced more strongly than any other strategy, and so that ends up being what gets chiseled into the model’s cognition. But of course in that limit “just brute-force sha256 bro” is an optimal strategy in certain situations, so the extent to which the “in the limit” behavior resembles the “in the regimes we actually care about” behavior is debatable.And “sufficiently” is likely a pretty low bar
If I’m reading Ability to Solve Long Horizon Tasks Correlates With Wanting correctly, that post argues that you can’t get good performance on any task where the reward is distant in time from the actions unless your system is doing something like this.
It mostly sounds like “LLMs don’t scale into scary things”, not “deceptive alignment is unlikely”.
Publicly noting that I had a similar moment recently; perhaps we listened to the same podcast.
I think there are two separate questions here, with possibly (and I suspect actually) very different answers:
How likely is deceptive alignment to arise in an LLM under SGD across a large very diverse pretraining set (such as a slice of the internet)?
How likely is deceptive alignment to be boosted in an LLM under SGD fine tuning followed by RL for HHH-behavior applied to a base model trained by 1.?
I think the obvious answer to 1. is that the LLM is going to attempt (limited by its available capacity and training set) to develop world models of everything that humans do that affects the contents of the Internet. One of the many things that humans do is pretend to be more aligned to the wishes of an authority that has power over them than they truly are. So for a large enough LLM, SGD will create a world model for this behavior along with thousands of other human behaviors, and the LLM will (depending on the prompt) tend to activate this behavior at about the frequency and level that you find it on the Internet, as modified by cues in the particular prompt. On the Internet, this is generally a mild background level for people writing while at work in Western countries, and probably more strongly for people writing from more authoritarian countries: specific prompts will be more or less correlated with this.
For 2., the question is whether fine-tuning followed by RL will settle on this preexisting mechanism and make heavy use of it as part of the way that it implements something that fits the fine-tuning set/scores well on the reward model aimed at creating a helpful, honest, and harmless assistant persona. I’m a lot less certain of the answer here, and I suspect it might depend rather strongly on the details of the training set. For example, is this evoking an “you’re at work, or in an authoritarian environment, so watch what you say and do” scenario that might boost the use of this particular behavior? The “harmless” element in HHH seems particularly concerning here: it suggests an environment in which certain things can’t be discussed, which tend to be the sorts of environments that evince this behavior more strongly in humans.
For a more detailed discussion, see the second half of this post.
Two quick thoughts (that don’t engage deeply with this nice post).
I’m worried in some cases where the goal is not consistent across situations. For example, if prompted to pursue some goal, it then does it seriously with convergent instrumental goals.
I think it seems pretty likely that future iterations of transformers will have bits of powerful search in them, but people who seem very worried about that search seem to think that once that search is established enough, gradient descent will cause the internals of the model to be organised mostly around that search (I imagine the search circuits “bubbling out” to be the outer structure of the learned algorithm). Probably this is all just conceptually confused, but to the extent it’s not, I’m pretty surprised by their intuition.
To me, it seems that consequentialism is just a really good algorithm to perform very well on a very wide range of tasks. Therefore, for difficult enough tasks, I expect that consequentialism is the kind of algorithm that would be found because it’s the kind of algorithm that can perform the task well.
When we start training the system we have a random initialization. Let’s make the following simplification. We have a goal in the system somewhere and then we have the consequential reasoning algorithm in the system somewhere. As we train the system the consequential reasoning will get better and better and the goal will get more and more aligned with the outer objective because both of these things will improve performance. However, there will come a point in training where the consequential reasoning algorithm is good enough to realize that it is in training. And then bad things start to happen. It will try to figure out and optimize it for the outer objective. SGD will incentivize this kind of behavior because it performs better than not doing it.
There really is a lot more to this kind of argument. So far I have failed to write it up. I hope the above is enough to hint at why I think that it is possible that being deceptive is just better than not being deceptive in terms of performance. When you become deceptive, that aligns the consequentialist reasoner faster to the outer objective, compared to waiting for SGD to gradually correct everything. In fact it is sort of a constant boost to your performance to be deceptive, even before the system has become very good at optimizing for the true objective. A really good consequential reasoner could probably just get zero loss immediately by retargeting its consequential reasoner instead of waiting for the goal to be updated to match the outer objective by SGD, as soon as it got a perfect model of the outer objective.
I’m not sure that deception is a problem. Maybe it is not. but to me, it really seems like you don’t provide any reason that makes me confident that it won’t be a problem. It seems very strange to me to argue that this is not a thing because existing arguments are flimsy. I mean you did not even address my argument. It’s like trying to prove that all bananas are green by showing me 3 green bananas.
TL;DR: I think you’re right that much inner alignment conversation is overly-fanciful. But ‘LLMs will do what they’re told’ and ‘deceptive alignment is unjustified’ are non sequiturs. Maybe we mean something different by ‘LLM’? Either way, I think there’s a case for ‘inner homunculi’ after all.
I have always thought that the ‘inner’ conversation is missing something. On the one hand it’s moderately-clearly identifying a type of object, which is a point in favour. Further, as you’ve said yourself, ‘reward is not the optimisation target’ is very obvious but it seems to need spelling out repeatedly (a service the ‘inner/outer’ distinction provides). On the other hand the ‘inner alignment’ conversation seems in practice to distract from the actual issue which is ‘some artefacts are (could be) doing their own planning/deliberation/optimisation’, and ‘inner’ is only properly pointing at a subset of those. (We can totally build, including accidentally, artefacts which do this ‘outside’ the weights of NN.)
You’ve indeed pointed at a few of the more fanciful parts of that discussion here[1], like steganographic gradient hacking. NB steganography per se isn’t entirely wild; we see ML systems use available bandwidth to develop unintuitive communication protocols a lot e.g. in MARL.
I can only assume you mean something very narrow and limited by ‘continuing to scale up LLMs’[2]. I think without specifying what you mean, your words are likely to be misconstrued.
With that said, I think something not far off the ‘inner consequentialist’ is entirely plausible and consistent with observations. In short, how do plan-like outputs emerge without a planning-like computation?[3] I’d say ‘undesired, covert, and consistent-across-situations inner goals’ is a weakman of deceptive alignment. Specialising to LLMs, the point is that they’ve exhibited poorly-understood somewhat-general planning capability, and that not all ways of turning that into competent AI assistants[4] result in aligned planners. Planners can be deceptive (and we have evidence of this).
I think your step from ‘there’s very little reason to expect “undesired, covert, and consistent-across-situations inner goals”...’ to ‘LLMs will continue doing what they’re told...’ is therefore a boldly overconfident non sequitur. (I recognise that the second step there is presented as ‘an alternative’, so it’s not clear whether you actually think that.)
Incidentally, I’d go further and say that, to me (but I guess you disagree?) it’s at least plausible that some ways of turning LLM planning capability into a competent AI assistant actually do result in ‘undesired, covert, and consistent-across-situations inner goals’! Sketch in thread.
FWIW I’m more concerned about highly competent planning coming about by one or another kind of scaffolding, but such a system can just as readily be deceptively aligned as a ‘vanilla’ LLM. If enough of the planning is externalised and understandable and actually monitored in practice, this might be easier to catch.
I’m apparently more sympathetic to fancy than you; in absence of mechanistic observables and in order to extrapolate/predict, we have to apply some imagination. But I agree there’s a lot of fancy; some of it even looks like hallucination getting reified by self-conditioning(!), and that effort could probably be more efficiently spent.
Like, already the default scaling work includes multimodality and native tool/API-integration and interleaved fine-tuning with self-supervised etc.
Yes, a giant lookup table of exactly the right data could do this. But, generalisably, practically and tractably...? We don’t believe NNs are GLUTs.
prompted, fine-tuned, RLHFed, or otherwise-conditioned LLM (ex hypothesi goal-pursuing), scaffolded, some yet-to-be-designed system…
I’m presently (quite badly IMO) trying to anticipate the shape of the next big step in get-things-done/autonomy.
I’ve had a hunch for a while that temporally abstract planning and prediction is key. I strongly suspect you can squeeze more consequential planning out of shortish serial depth than most people give credit for. This is informed by past RL-flavoured stuff like MuZero and its limitations, by observations of humans and animals (inc myself), and by general CS/algos thinking. Actually this is where I get on the LLM train. It seems to me that language is an ideal substrate for temporally abstract planning and prediction, and lots of language data in the wild exemplifies this. NB I don’t think GPTs or LLMs are uniquely on this trajectory, just getting a big bootstrap.
Now, if I had to make the most concrete ‘inner homunculus’ case off the cuff, I’d start in the vicinity of Good Regulator, except a more conjectury version regarding systems-predicting-planners (I am working on sharpening this). Maybe I’d point at Janus’ Simulators post. I suspect there might be something like an impossibility/intractability theorem for predicting planners of the right kind without running a planner of a similar kind. (Handwave!)
I’d observe that GPTs can predict planning-looking actions, including sometimes without CoT. (NOTE here’s where the most concrete and proximal evidence is!) This includes characters engaging in deceit. I’d invoke my loose reasoning regarding temporal abstraction to support the hypothesis that this is ‘more than mere parroting’, and maybe fish for examples quite far from obvious training settings to back this up. Interp would be super, of course! (Relatedly, some of your work on steering policies via activation editing has sparked my interest.)
I think maybe this is enough to transfer some sense of what I’m getting at? At this point, given some (patchy) theory, the evidence is supportive of (among other hypotheses) an ‘inner planning’ hypothesis (of quite indeterminate form).
Finally, one kind or another of ‘conditioning’ is hypothesised to reinforce the consequentialist component(s) ‘somehow’ (handwave again, though I’m hardly the only one guilty of handwaving about RLHF et al). I think it’s appropriate to be uncertain what form the inner planning takes, what form the conditioning can/will take, and what the eventual results of that are. Interested in evidence and theory around this area.
So, what are we talking about when we say ‘LLM’? Plain GPT? Well, they definitely don’t ‘do what they’re told’[1]. They exhibit planning-like outputs with the right prompts, typically associated with ‘simulated characters’ at some resolution or other. What about RLHFed GPTs? Well, they sometimes ‘do what they’re told’. They also exhibit planning-like outputs with the right prompts, and it’s mechanistically very unclear how they’re getting them.
unless you mean predicting the next token (I’m pretty sure you don’t mean this?), which they do quite well, though we don’t know how, nor when it’ll fail