You’ve given me a lot to think about, thanks! Here are my thoughts as I read:
(1) One is that the actual kinds of reasoning that an LLM can learn in its forward pass are quite limited.
As is well established, for instance, Transformers cannot multiply arbitrarily-long integers in a single forward pass. The number of additions involved in multiplying an N-digit integer increases in an unbounded way with N; thus, a Transformer with with a finite number of layers cannot do it. (Example: Prompt GPT-4 for the results of multiplying two 5-digit numbers, specifying not to use a calculator, see how it does.)
Of course in use you can teach a GPT to use a calculator—but we’re talking about operations that occur in single forward pass, which rules out using tools. Because of this shallow serial depth, a Transformer also cannot (1) divide arbitrary integers, (2) figure out the results of physical phenomena that have multiplication / division problems embedded in them, (3) figure out the results of arbitrary programs with loops, and so on.
...
So to think that you learn deception in forward pass, you have to think that the transformer thinks something like “Hey, if I deceive the user into thinking that I’m a good entity, I’ll be able to later seize power, and if I seize power, then I’ll be able to (do whatever), so—considering all this, I should… predict the next token will be “purple”″ -- and that it thinks this in a context that could NOT come up with the algorithm for multiplication, or for addition, or for any number of other things, even though an algorithm for multiplication would be much much MUCH more directly incentivized by SGD, because it’s directly relevant for token predictions.
Yes, and the sort of deceptive reasoning I’m worried about sure seems pretty simple, very little serial depth to it. Unlike multiplying two 5-digit integers. For example the example you give involves like 6 steps. I’m pretty sure GPT4 already does ‘reasoning’ of about that level of sophistication in a single forward pass, e.g. when predicting the next token of the transcript of a conversation in which one human is deceiving another about something. (In fact, in general, how do you explain how LLMs can predict deceptive text, if they simply don’t have enough layers to do all the deceptive reasoning without ‘spilling the beans’ into the token stream?)
(2). Another way to get at the problem with this reasoning, is that I think it hypothesizes an agent within weight updates off the analogical resemblance to an agent that the finished product has. But in fact there’s at most a superficial resemblance between (LLM forward pass) and (repeated LLM forward passes in a Chain-of-thought over text).
That is, an LLM unrolled multiple times, from a given prompt, can make plans; it can plot to seize power, imitating humans who it saw thus plot; it can multiply N-digit integers, working them out just like a human. But this tells us literally nothing about what it can do in a single forward pass.
For comparison, consider a large neural network that is used for image segmentation. The entire physical world falls into the domain of such a model. It can learn that people exist, that dogs exist, and that machinery exists, in some sense. What if such a neural network—in a single forward pass—used deceptive reasoning, which turned out to be useful for prediction because of the backward pass, and that we ought therefore expect that such a neural network—when embedded in some device down the road—would turn and kill us?
The argument is exactly identical to the case of the language model, but no one makes it. And I think the reason is that people think about the properties that a trained LLM can exhibit *when unrolled over multiple forward passes, in a particular context and with a particular prompt, and then mistakenly attribute these properties to the single forward pass.
(All of which is to say—look, if you think you can get a deceptive agent from a LLM this way you should also expect a deceptive agent from an image segmentation model. Maybe that’s true! But I’ve never seen anyone say this, which makes me think they’re making the mistake I describe above.)
The reason I’m not worried about image segmentation models is that it doesn’t seem like they’d have the relevant capabilities or goals. Maybe in the limit they would—if we somehow banned all other kinds of AI, but let image segmentation models scale to arbitrary size and training data amounts—then eventually after decades of scaling and adding 9′s of reliability to their image predictions, they’d end up with scary agents inside because that would be useful for getting one of those 9′s. But yeah, it’s a pretty good bet that the relevant kinds of capabilities (e.g. ability to coherently pursue goals, ability to write code, ability to persuade humans of stuff) are most likely to appear earliest in systems that are trained in environments more tailored to developing those capabilities. tl;dr my answer is ’wake me when an image segmentation model starts performing well in dangerous capabilities evals like METR’s and OpenAI’s. Which won’t happen for a long time because image segmentation models are going to be worse at agency than models explicitly trained to be agents.”
(3). I think this is just attributing extremely complex machinery to the forward pass of an LLM that is supposed to show up in a data-indifferent manner, and that this is a universally bad bet for ML.
Like, different Transformers store different things depending on the data they’re given. If you train them on SciHub they store a bunch of SciHub shit. If you train them on Wikipedia they store a bunch of Wikipedia shit. In every case, for each weight in the Transformer, you can find specific reasons for each neuron being what it is because of the data.
The “LLM will learn deception” hypothesis amounts to saying that—so long as a LLM is big enough, and trained on enough data to know the world exists—you’ll find complex machinery in it that (1) specifically activates once it figures out that it’s “not in training” and (2) was mostly just hiding until then. My bet is that this won’t show up, because there are no such structures in a Transformer that don’t depend on data. Your French Transformer / English Transformer / Toolformer / etc will not all learn to betray you if they get big enough—we will not find unused complex machinery in a Transformer to betray you because we find NO unused complex machinery in a transformer, etc
I think this is a misunderstanding of the LLM will learn deception hypothesis. First of all, the conditions of the hypothesis are not just “so long as it’s big enough and knows the world exists.” It’s more stringent than that; there probably needs to be agency, for example (goal-directedness) and situational awareness. (Though I think John Wentworth disagrees?)
Secondly, the “complex machinery” claim is actually trivial, though you make it sound like it’s crazy. ANY behavior of a neural net in situation class X, which does not appear in situation class Y, is the result of ‘unused-in-Y complex machinery.’ So set y = training and x = deployment, and literally any claim about how deployment will be different from training involves this.
Another different approach I could take would be: The complex machinery DOES get used a lot in training. Indeed that’s why it evolved/was-formed-by-SGD. The complex machinery is the goal-directedness machinery, the machinery that chooses actions on the basis of how well the action is predicted to serve the goals. That machinery is presumably used all the fucking time in training, and it causes the system to behave-as-if-aligned in training and behave in blatantly unaligned ways once it’s very obvious that it can get away with doing so.
You’ve given me a lot to think about, thanks! Here are my thoughts as I read:
Yes, and the sort of deceptive reasoning I’m worried about sure seems pretty simple, very little serial depth to it. Unlike multiplying two 5-digit integers. For example the example you give involves like 6 steps. I’m pretty sure GPT4 already does ‘reasoning’ of about that level of sophistication in a single forward pass, e.g. when predicting the next token of the transcript of a conversation in which one human is deceiving another about something. (In fact, in general, how do you explain how LLMs can predict deceptive text, if they simply don’t have enough layers to do all the deceptive reasoning without ‘spilling the beans’ into the token stream?)
The reason I’m not worried about image segmentation models is that it doesn’t seem like they’d have the relevant capabilities or goals. Maybe in the limit they would—if we somehow banned all other kinds of AI, but let image segmentation models scale to arbitrary size and training data amounts—then eventually after decades of scaling and adding 9′s of reliability to their image predictions, they’d end up with scary agents inside because that would be useful for getting one of those 9′s. But yeah, it’s a pretty good bet that the relevant kinds of capabilities (e.g. ability to coherently pursue goals, ability to write code, ability to persuade humans of stuff) are most likely to appear earliest in systems that are trained in environments more tailored to developing those capabilities. tl;dr my answer is ’wake me when an image segmentation model starts performing well in dangerous capabilities evals like METR’s and OpenAI’s. Which won’t happen for a long time because image segmentation models are going to be worse at agency than models explicitly trained to be agents.”
I think this is a misunderstanding of the LLM will learn deception hypothesis. First of all, the conditions of the hypothesis are not just “so long as it’s big enough and knows the world exists.” It’s more stringent than that; there probably needs to be agency, for example (goal-directedness) and situational awareness. (Though I think John Wentworth disagrees?)
Secondly, the “complex machinery” claim is actually trivial, though you make it sound like it’s crazy. ANY behavior of a neural net in situation class X, which does not appear in situation class Y, is the result of ‘unused-in-Y complex machinery.’ So set y = training and x = deployment, and literally any claim about how deployment will be different from training involves this.
Another different approach I could take would be: The complex machinery DOES get used a lot in training. Indeed that’s why it evolved/was-formed-by-SGD. The complex machinery is the goal-directedness machinery, the machinery that chooses actions on the basis of how well the action is predicted to serve the goals. That machinery is presumably used all the fucking time in training, and it causes the system to behave-as-if-aligned in training and behave in blatantly unaligned ways once it’s very obvious that it can get away with doing so.