So, there are a few different reasons, none of which I’ve formalized to my satisfaction.
I’m curious if these make sense to you.
(1) One is that the actual kinds of reasoning that an LLM can learn in its forward pass are quite limited.
As is well established, for instance, Transformers cannot multiply arbitrarily-long integers in a single forward pass. The number of additions involved in multiplying an N-digit integer increases in an unbounded way with N; thus, a Transformer with with a finite number of layers cannot do it. (Example: Prompt GPT-4 for the results of multiplying two 5-digit numbers, specifying not to use a calculator, see how it does.)
Of course in use you can teach a GPT to use a calculator—but we’re talking about operations that occur in single forward pass, which rules out using tools. Because of this shallow serial depth, a Transformer also cannot (1) divide arbitrary integers, (2) figure out the results of physical phenomena that have multiplication / division problems embedded in them, (3) figure out the results of arbitrary programs with loops, and so on.
(Note—to be very clear NONE of this is a limitation on what kind of operations we can get a transformer to do over multiple unrollings of the forward pass. You can teach a transformer to use a calculator; or to ask a friend for help; or to use a scratchpad, or whatever. But we need to hide deception in a single forward pass, which is why I’m harping on this.)
So to think that you learn deception in forward pass, you have to think that the transformer thinks something like “Hey, if I deceive the user into thinking that I’m a good entity, I’ll be able to later seize power, and if I seize power, then I’ll be able to (do whatever), so—considering all this, I should… predict the next token will be “purple”″ -- and that it thinks this in a context that could NOT come up with the algorithm for multiplication, or for addition, or for any number of other things, even though an algorithm for multiplication would be much much MUCH more directly incentivized by SGD, because it’s directly relevant for token predictions.
(2). Another way to get at the problem with this reasoning, is that I think it hypothesizes an agent within weight updates off the analogical resemblance to an agent that the finished product has. But in fact there’s at most a superficial resemblance between (LLM forward pass) and (repeated LLM forward passes in a Chain-of-thought over text).
That is, an LLM unrolled multiple times, from a given prompt, can make plans; it can plot to seize power, imitating humans who it saw thus plot; it can multiply N-digit integers, working them out just like a human. But this tells us literally nothing about what it can do in a single forward pass.
For comparison, consider a large neural network that is used for image segmentation. The entire physical world falls into the domain of such a model. It can learn that people exist, that dogs exist, and that machinery exists, in some sense. What if such a neural network—in a single forward pass—used deceptive reasoning, which turned out to be useful for prediction because of the backward pass, and that we ought therefore expect that such a neural network—when embedded in some device down the road—would turn and kill us?
The argument is exactly identical to the case of the language model, but no one makes it. And I think the reason is that people think about the properties that a trained LLM can exhibit *when unrolled over multiple forward passes, in a particular context and with a particular prompt, and then mistakenly attribute these properties to the single forward pass.
(All of which is to say—look, if you think you can get a deceptive agent from a LLM this way you should also expect a deceptive agent from an image segmentation model. Maybe that’s true! But I’ve never seen anyone say this, which makes me think they’re making the mistake I describe above.)
(3). I think this is just attributing extremely complex machinery to the forward pass of an LLM that is supposed to show up in a data-indifferent manner, and that this is a universally bad bet for ML.
Like, different Transformers store different things depending on the data they’re given. If you train them on SciHub they store a bunch of SciHub shit. If you train them on Wikipedia they store a bunch of Wikipedia shit. In every case, for each weight in the Transformer, you can find specific reasons for each neuron being what it is because of the data.
The “LLM will learn deception” hypothesis amounts to saying that—so long as a LLM is big enough, and trained on enough data to know the world exists—you’ll find complex machinery in it that (1) specifically activates once it figures out that it’s “not in training” and (2) was mostly just hiding until then. My bet is that this won’t show up, because there are no such structures in a Transformer that don’t depend on data. Your French Transformer / English Transformer / Toolformer / etc will not all learn to betray you if they get big enough—we will not find unused complex machinery in a Transformer to betray you because we find NO unused complex machinery in a transformer, etc.
I think an actually well-put together argument will talk about frequency bias and shit, but this is all I feel like typing for now.
Does this make sense? I’m still working on putting it together.
You’ve given me a lot to think about, thanks! Here are my thoughts as I read:
(1) One is that the actual kinds of reasoning that an LLM can learn in its forward pass are quite limited.
As is well established, for instance, Transformers cannot multiply arbitrarily-long integers in a single forward pass. The number of additions involved in multiplying an N-digit integer increases in an unbounded way with N; thus, a Transformer with with a finite number of layers cannot do it. (Example: Prompt GPT-4 for the results of multiplying two 5-digit numbers, specifying not to use a calculator, see how it does.)
Of course in use you can teach a GPT to use a calculator—but we’re talking about operations that occur in single forward pass, which rules out using tools. Because of this shallow serial depth, a Transformer also cannot (1) divide arbitrary integers, (2) figure out the results of physical phenomena that have multiplication / division problems embedded in them, (3) figure out the results of arbitrary programs with loops, and so on.
...
So to think that you learn deception in forward pass, you have to think that the transformer thinks something like “Hey, if I deceive the user into thinking that I’m a good entity, I’ll be able to later seize power, and if I seize power, then I’ll be able to (do whatever), so—considering all this, I should… predict the next token will be “purple”″ -- and that it thinks this in a context that could NOT come up with the algorithm for multiplication, or for addition, or for any number of other things, even though an algorithm for multiplication would be much much MUCH more directly incentivized by SGD, because it’s directly relevant for token predictions.
Yes, and the sort of deceptive reasoning I’m worried about sure seems pretty simple, very little serial depth to it. Unlike multiplying two 5-digit integers. For example the example you give involves like 6 steps. I’m pretty sure GPT4 already does ‘reasoning’ of about that level of sophistication in a single forward pass, e.g. when predicting the next token of the transcript of a conversation in which one human is deceiving another about something. (In fact, in general, how do you explain how LLMs can predict deceptive text, if they simply don’t have enough layers to do all the deceptive reasoning without ‘spilling the beans’ into the token stream?)
(2). Another way to get at the problem with this reasoning, is that I think it hypothesizes an agent within weight updates off the analogical resemblance to an agent that the finished product has. But in fact there’s at most a superficial resemblance between (LLM forward pass) and (repeated LLM forward passes in a Chain-of-thought over text).
That is, an LLM unrolled multiple times, from a given prompt, can make plans; it can plot to seize power, imitating humans who it saw thus plot; it can multiply N-digit integers, working them out just like a human. But this tells us literally nothing about what it can do in a single forward pass.
For comparison, consider a large neural network that is used for image segmentation. The entire physical world falls into the domain of such a model. It can learn that people exist, that dogs exist, and that machinery exists, in some sense. What if such a neural network—in a single forward pass—used deceptive reasoning, which turned out to be useful for prediction because of the backward pass, and that we ought therefore expect that such a neural network—when embedded in some device down the road—would turn and kill us?
The argument is exactly identical to the case of the language model, but no one makes it. And I think the reason is that people think about the properties that a trained LLM can exhibit *when unrolled over multiple forward passes, in a particular context and with a particular prompt, and then mistakenly attribute these properties to the single forward pass.
(All of which is to say—look, if you think you can get a deceptive agent from a LLM this way you should also expect a deceptive agent from an image segmentation model. Maybe that’s true! But I’ve never seen anyone say this, which makes me think they’re making the mistake I describe above.)
The reason I’m not worried about image segmentation models is that it doesn’t seem like they’d have the relevant capabilities or goals. Maybe in the limit they would—if we somehow banned all other kinds of AI, but let image segmentation models scale to arbitrary size and training data amounts—then eventually after decades of scaling and adding 9′s of reliability to their image predictions, they’d end up with scary agents inside because that would be useful for getting one of those 9′s. But yeah, it’s a pretty good bet that the relevant kinds of capabilities (e.g. ability to coherently pursue goals, ability to write code, ability to persuade humans of stuff) are most likely to appear earliest in systems that are trained in environments more tailored to developing those capabilities. tl;dr my answer is ’wake me when an image segmentation model starts performing well in dangerous capabilities evals like METR’s and OpenAI’s. Which won’t happen for a long time because image segmentation models are going to be worse at agency than models explicitly trained to be agents.”
(3). I think this is just attributing extremely complex machinery to the forward pass of an LLM that is supposed to show up in a data-indifferent manner, and that this is a universally bad bet for ML.
Like, different Transformers store different things depending on the data they’re given. If you train them on SciHub they store a bunch of SciHub shit. If you train them on Wikipedia they store a bunch of Wikipedia shit. In every case, for each weight in the Transformer, you can find specific reasons for each neuron being what it is because of the data.
The “LLM will learn deception” hypothesis amounts to saying that—so long as a LLM is big enough, and trained on enough data to know the world exists—you’ll find complex machinery in it that (1) specifically activates once it figures out that it’s “not in training” and (2) was mostly just hiding until then. My bet is that this won’t show up, because there are no such structures in a Transformer that don’t depend on data. Your French Transformer / English Transformer / Toolformer / etc will not all learn to betray you if they get big enough—we will not find unused complex machinery in a Transformer to betray you because we find NO unused complex machinery in a transformer, etc
I think this is a misunderstanding of the LLM will learn deception hypothesis. First of all, the conditions of the hypothesis are not just “so long as it’s big enough and knows the world exists.” It’s more stringent than that; there probably needs to be agency, for example (goal-directedness) and situational awareness. (Though I think John Wentworth disagrees?)
Secondly, the “complex machinery” claim is actually trivial, though you make it sound like it’s crazy. ANY behavior of a neural net in situation class X, which does not appear in situation class Y, is the result of ‘unused-in-Y complex machinery.’ So set y = training and x = deployment, and literally any claim about how deployment will be different from training involves this.
Another different approach I could take would be: The complex machinery DOES get used a lot in training. Indeed that’s why it evolved/was-formed-by-SGD. The complex machinery is the goal-directedness machinery, the machinery that chooses actions on the basis of how well the action is predicted to serve the goals. That machinery is presumably used all the fucking time in training, and it causes the system to behave-as-if-aligned in training and behave in blatantly unaligned ways once it’s very obvious that it can get away with doing so.
These all seem like reasonable reasons to doubt the hypothesized mechanism, yup. I think you’re underestimating how much can happen in a single forward pass, though—it has to be somewhat shallow, so it can’t involve too many variables, but the whole point of making the networks as large as we do these days is that it turns out an awful lot can happen in parallel. I also think there would be no reason for deception to occur if it’s never a good weight pattern to use to predict the data, it’s only if the data contains a pattern that the gradient will put into a deceptive forward mechanism that this could possibly occur. For example, if the model is trained on a bunch of humans being deceptive about their political intentions, and then RLHF is attempted.
In any case, I don’t think the old yudkowsky model of deceptive alignment is relevant, in that I think the level of deception to expect from ai should be calibrated to be around the amount you’d expect from a young human, not some super schemer god. The concern arises only when the data actually contains patterns well modeled by deception, and this would be expected to be more present in the case of something like an engagement maximizer online learning RL system.
And to be clear I don’t expect the things that can destroy humanity to arise because of deception directly. It seems much more likely to me that they’ll arise because competing people ask their model to do something that puts those models in competition in a way that puts humanity at risk, eg several different powerful competing model based engagement/sales optimizing reinforcement learners, or more speculatively something military. Something where the core problem is that alignment tech is effectively not used, and where solving this deception problem wouldn’t have saved us anyway.
Regarding the details of your descriptions: I really mainly think this sort of deception would arise in the wild when there’s a reward model passing gradients to multiple steps of a sequential model, or possibly the imitating humans locally thing. But without a reward model, nothing pushes the different steps of the sequential model towards trying to achieve the “same thing” across different steps in any significant sense. But of course almost all the really useful models involve a reward model somehow.
So, there are a few different reasons, none of which I’ve formalized to my satisfaction.
I’m curious if these make sense to you.
(1) One is that the actual kinds of reasoning that an LLM can learn in its forward pass are quite limited.
As is well established, for instance, Transformers cannot multiply arbitrarily-long integers in a single forward pass. The number of additions involved in multiplying an N-digit integer increases in an unbounded way with N; thus, a Transformer with with a finite number of layers cannot do it. (Example: Prompt GPT-4 for the results of multiplying two 5-digit numbers, specifying not to use a calculator, see how it does.)
Of course in use you can teach a GPT to use a calculator—but we’re talking about operations that occur in single forward pass, which rules out using tools. Because of this shallow serial depth, a Transformer also cannot (1) divide arbitrary integers, (2) figure out the results of physical phenomena that have multiplication / division problems embedded in them, (3) figure out the results of arbitrary programs with loops, and so on.
(Note—to be very clear NONE of this is a limitation on what kind of operations we can get a transformer to do over multiple unrollings of the forward pass. You can teach a transformer to use a calculator; or to ask a friend for help; or to use a scratchpad, or whatever. But we need to hide deception in a single forward pass, which is why I’m harping on this.)
So to think that you learn deception in forward pass, you have to think that the transformer thinks something like “Hey, if I deceive the user into thinking that I’m a good entity, I’ll be able to later seize power, and if I seize power, then I’ll be able to (do whatever), so—considering all this, I should… predict the next token will be “purple”″ -- and that it thinks this in a context that could NOT come up with the algorithm for multiplication, or for addition, or for any number of other things, even though an algorithm for multiplication would be much much MUCH more directly incentivized by SGD, because it’s directly relevant for token predictions.
(2). Another way to get at the problem with this reasoning, is that I think it hypothesizes an agent within weight updates off the analogical resemblance to an agent that the finished product has. But in fact there’s at most a superficial resemblance between (LLM forward pass) and (repeated LLM forward passes in a Chain-of-thought over text).
That is, an LLM unrolled multiple times, from a given prompt, can make plans; it can plot to seize power, imitating humans who it saw thus plot; it can multiply N-digit integers, working them out just like a human. But this tells us literally nothing about what it can do in a single forward pass.
For comparison, consider a large neural network that is used for image segmentation. The entire physical world falls into the domain of such a model. It can learn that people exist, that dogs exist, and that machinery exists, in some sense. What if such a neural network—in a single forward pass—used deceptive reasoning, which turned out to be useful for prediction because of the backward pass, and that we ought therefore expect that such a neural network—when embedded in some device down the road—would turn and kill us?
The argument is exactly identical to the case of the language model, but no one makes it. And I think the reason is that people think about the properties that a trained LLM can exhibit *when unrolled over multiple forward passes, in a particular context and with a particular prompt, and then mistakenly attribute these properties to the single forward pass.
(All of which is to say—look, if you think you can get a deceptive agent from a LLM this way you should also expect a deceptive agent from an image segmentation model. Maybe that’s true! But I’ve never seen anyone say this, which makes me think they’re making the mistake I describe above.)
(3). I think this is just attributing extremely complex machinery to the forward pass of an LLM that is supposed to show up in a data-indifferent manner, and that this is a universally bad bet for ML.
Like, different Transformers store different things depending on the data they’re given. If you train them on SciHub they store a bunch of SciHub shit. If you train them on Wikipedia they store a bunch of Wikipedia shit. In every case, for each weight in the Transformer, you can find specific reasons for each neuron being what it is because of the data.
The “LLM will learn deception” hypothesis amounts to saying that—so long as a LLM is big enough, and trained on enough data to know the world exists—you’ll find complex machinery in it that (1) specifically activates once it figures out that it’s “not in training” and (2) was mostly just hiding until then. My bet is that this won’t show up, because there are no such structures in a Transformer that don’t depend on data. Your French Transformer / English Transformer / Toolformer / etc will not all learn to betray you if they get big enough—we will not find unused complex machinery in a Transformer to betray you because we find NO unused complex machinery in a transformer, etc.
I think an actually well-put together argument will talk about frequency bias and shit, but this is all I feel like typing for now.
Does this make sense? I’m still working on putting it together.
You’ve given me a lot to think about, thanks! Here are my thoughts as I read:
Yes, and the sort of deceptive reasoning I’m worried about sure seems pretty simple, very little serial depth to it. Unlike multiplying two 5-digit integers. For example the example you give involves like 6 steps. I’m pretty sure GPT4 already does ‘reasoning’ of about that level of sophistication in a single forward pass, e.g. when predicting the next token of the transcript of a conversation in which one human is deceiving another about something. (In fact, in general, how do you explain how LLMs can predict deceptive text, if they simply don’t have enough layers to do all the deceptive reasoning without ‘spilling the beans’ into the token stream?)
The reason I’m not worried about image segmentation models is that it doesn’t seem like they’d have the relevant capabilities or goals. Maybe in the limit they would—if we somehow banned all other kinds of AI, but let image segmentation models scale to arbitrary size and training data amounts—then eventually after decades of scaling and adding 9′s of reliability to their image predictions, they’d end up with scary agents inside because that would be useful for getting one of those 9′s. But yeah, it’s a pretty good bet that the relevant kinds of capabilities (e.g. ability to coherently pursue goals, ability to write code, ability to persuade humans of stuff) are most likely to appear earliest in systems that are trained in environments more tailored to developing those capabilities. tl;dr my answer is ’wake me when an image segmentation model starts performing well in dangerous capabilities evals like METR’s and OpenAI’s. Which won’t happen for a long time because image segmentation models are going to be worse at agency than models explicitly trained to be agents.”
I think this is a misunderstanding of the LLM will learn deception hypothesis. First of all, the conditions of the hypothesis are not just “so long as it’s big enough and knows the world exists.” It’s more stringent than that; there probably needs to be agency, for example (goal-directedness) and situational awareness. (Though I think John Wentworth disagrees?)
Secondly, the “complex machinery” claim is actually trivial, though you make it sound like it’s crazy. ANY behavior of a neural net in situation class X, which does not appear in situation class Y, is the result of ‘unused-in-Y complex machinery.’ So set y = training and x = deployment, and literally any claim about how deployment will be different from training involves this.
Another different approach I could take would be: The complex machinery DOES get used a lot in training. Indeed that’s why it evolved/was-formed-by-SGD. The complex machinery is the goal-directedness machinery, the machinery that chooses actions on the basis of how well the action is predicted to serve the goals. That machinery is presumably used all the fucking time in training, and it causes the system to behave-as-if-aligned in training and behave in blatantly unaligned ways once it’s very obvious that it can get away with doing so.
These all seem like reasonable reasons to doubt the hypothesized mechanism, yup. I think you’re underestimating how much can happen in a single forward pass, though—it has to be somewhat shallow, so it can’t involve too many variables, but the whole point of making the networks as large as we do these days is that it turns out an awful lot can happen in parallel. I also think there would be no reason for deception to occur if it’s never a good weight pattern to use to predict the data, it’s only if the data contains a pattern that the gradient will put into a deceptive forward mechanism that this could possibly occur. For example, if the model is trained on a bunch of humans being deceptive about their political intentions, and then RLHF is attempted.
In any case, I don’t think the old yudkowsky model of deceptive alignment is relevant, in that I think the level of deception to expect from ai should be calibrated to be around the amount you’d expect from a young human, not some super schemer god. The concern arises only when the data actually contains patterns well modeled by deception, and this would be expected to be more present in the case of something like an engagement maximizer online learning RL system.
And to be clear I don’t expect the things that can destroy humanity to arise because of deception directly. It seems much more likely to me that they’ll arise because competing people ask their model to do something that puts those models in competition in a way that puts humanity at risk, eg several different powerful competing model based engagement/sales optimizing reinforcement learners, or more speculatively something military. Something where the core problem is that alignment tech is effectively not used, and where solving this deception problem wouldn’t have saved us anyway.
Regarding the details of your descriptions: I really mainly think this sort of deception would arise in the wild when there’s a reward model passing gradients to multiple steps of a sequential model, or possibly the imitating humans locally thing. But without a reward model, nothing pushes the different steps of the sequential model towards trying to achieve the “same thing” across different steps in any significant sense. But of course almost all the really useful models involve a reward model somehow.