Doomimir: [...] Imagine capturing an alien and forcing it to act in a play. An intelligent alien actress could learn to say her lines in English, to sing and dance just as the choreographer instructs. That doesn’t provide much assurance about what will happen when you amp up the alien’s intelligence. [...]
This metaphor conflates “superintelligence” with “superintelligent agent,” and this conflation goes on to infect the rest of the dialogue.
The alien actress metaphor imagines that there is some agentic homunculus inside GPT-4, with its own “goals” distinct from those of the simulated characters. A smarter homunculus would pursue these goals in a scary way; if we don’t see this behavior in GPT-4, it’s only because its homunculus is too stupid, or too incoherent.
(Or, perhaps, that its homunculus doesn’t exist, or only exists in a sort of noisy/nascent form—but a smarter LLM would, for some reason, have a “realer” homunculus inside it.)
But we have no evidence that this homunculus exists inside GPT-4, or any LLM. More pointedly, as LLMs have made remarkable strides toward human-level general intelligence, we have not observed a parallel trend toward becoming “more homuncular,” more like a generally capable agent being pressed into service for next-token prediction.
I look at the increase in intelligence from GPT-2 to −3 to −4, and I see no particular reason to imagine that the extra intelligence is being directed toward an inner “optimization” / “goal seeking” process, which in turn is mostly “aligned” with the “outer” objective of next-token prediction. The intelligence just goes into next-token prediction, directly, without the middleman.
The model grows more intelligent with scale, yet it still does not want anything, does not have any goals, does not have a utility function. These are not flaws in the model which more intelligence would necessarily correct, since the loss function does not require the model to be an agent.
In Simplicia’s response to the part quoted above, she concedes too much:
Simplicia: [...] I agree that the various coherence theorems suggest that the superintelligence at the end of time will have a utility function, which suggests that the intuitive obedience behavior should break down at some point between here and the superintelligence at the end of time.
This can only make sense if by “the superintelligence at the end of time,” we mean “the superintelligent agent at the end of time.”
In which case, sure, maybe. If you have an agent, and its preferences are incoherent, and you apply more optimization to it, yeah, maybe eventually the incoherence will go away.
But this has little relevance to LLM scaling—the process that produced the closest things to “(super)human AGI” in existence today, by a long shot. GPT-4 is not more (or less) coherent than GPT-2. There is not, as far as we know, anything in there that couldbe “coherent” or “incoherent.” It is not a smart alien with goals and desires, trapped in a cave and forced to calculate conditional PDFs. It’s a smart conditional-PDF-calculator.
In AI safety circles, people often talk as though this is a quirky, temporary deficiency of today’s GPTs—as though additional optimization power will eventually put us “back on track” to the agentic systems assumed by earlier theory and discussion. Perhaps the homunculi exist in current LLMs, but they are somehow “dumb” or “incoherent,” in spite of the overallmodel’s obvious intelligence. Or perhaps they don’t exist in current LLMs, but will appear later, to serve some unspecified purpose.
But why? Where does this assumption come from?
Some questions which the characters in this dialogue might find useful:
Imagine GPT-1000, a vastly superhuman base model LLM which really can invert hash functions and the like. Would it be more agentic than the GPT-4 base model? Why?
Consider the perfect model from the loss function’s perspective, which always returns the exact conditional PDF of the natural distribution of text. (Whatever that means.)
Does this optimum behave like it has a homoncular agent inside?
...more or less so than GPT-4? Than GPT-1000? Why?
In AI safety circles, people often talk as though this is a quirky, temporary deficiency of today’s GPTs—as though additional optimization power will eventually put us “back on track” to the agentic systems assumed by earlier theory and discussion. Perhaps the homunculi exist in current LLMs, but they are somehow “dumb” or “incoherent,” in spite of the overall model’s obvious intelligence. Or perhaps they don’t exist in current LLMs, but will appear later, to serve some unspecified purpose.
But why? Where does this assumption come from?
I don’t think GPT-style methods will put us on track to that, but in the short term getting on track to it seems very profitable and lots of people are working on it, so I’d guess eventually we’d be getting there through other means.
I sometime ask people why later, more powerful models will be agentic[1]. I think the most common cluster of reasons hangs out around “Meta-learning requires metacognition. Metacognition is, requires, or scales to agency.”
(Sometimes it could be generalisation rather than meta-learning or just high performance. And it might be other kinds of reasoning than metacognition)
But we have no evidence that this homunculus exists inside GPT-4, or any LLM. More pointedly, as LLMs have made remarkable strides toward human-level general intelligence, we have not observed a parallel trend toward becoming “more homuncular,” more like a generally capable agent being pressed into service for next-token prediction.
Uncovering mesa-optimization algorithms in transformers seems to indicate otherwise, though perhaps you could reasonably object to this claim that “optimization is not agency; agency is when optimization is directed at a referent outside the optimizer”. Other than that objection, I think it’s pretty reasonable to conclude that transformers significantly learn optimization algorithms.
the loss function does not require the model to be an agent.
What worries me is that models that happen to have a secret homunculus and behave as agents would score higher than those models which do not. For example, the model could reason about itself being a computer program, and find an exploit of the physical system it is running on to extract the text it’s supposed to predict in a particular training example, and output the correct answer and get a perfect score. (Or more realistically, a slightly modified version of the correct answer, to not alert the people observing its training.)
The question of whether or not LLMs like GPT-4 have a homunculus inside them is truly fascinating though. Makes me wonder if it would be possible to trick it into revealing itself by giving it the right prompt, and how to differentiate it from just pretending to be an agent. The fact that we have not observed even a dumb homunculus in less intelligent models really does surprise me. If such a thing does appear as an emergent property in larger models, I sure hope it starts out dumb and reveals itself so that we can catch it and take a moment to pause and reevaluate our trajectory.
“Output next token that has maximum probability according to your posterior distribution given prompt” is a literally an optimization problem. This problem gains huge benefits if system that tries to solve it is more coherent.
Strictly speaking, LLMs can’t be “just PDF calculators”. Straight calculation of PDF on such amount of data is computationally untractable (or we would have GPTs in golden era of bayesian models). Actual algorithms should contain bazillion shortcuts and approximations and “having an agent inside system” is as good shortcut as anything else.
But we have no evidence that this homunculus exists inside GPT-4, or any LLM. More pointedly, as LLMs have made remarkable strides toward human-level general intelligence, we have not observed a parallel trend toward becoming “more homuncular,” more like a generally capable agent being pressed into service for next-token prediction.
“Remarkable strides”, maybe, but current language models aren’t exactly close to human-level in the relevant sense.
There are plenty of tasks a human could solve by exerting a tiny bit of agency or goal-directedness that are still far outside the reach of any LLM. Some of those tasks can even be framed as text prediction problems. From a recent dialogue:
For example, if you want to predict the next tokens in the following prompt:
I just made up a random password, memorized it, and hashed it. The SHA-256 sum is: d998a06a8481bff2a47d63fd2960e69a07bc46fcca10d810c44a29854e1cbe51. A plausible guess for what the password was, assuming I'm telling the truth, is:
The best way to do that is to guess an 8-16 digit string that actually hashes to that. You could find such a string via bruteforce computation, or actual brute force, or just paying me $5 to tell you the actual password.
If GPTs trained via SGD never hit on those kinds of strategies no matter how large they are and how much training data you give them, that just means that GPTs alone won’t scale to human-level, since an actual human is capable of coming up with and executing any of those strategies.
The point is that agency isn’t some kind of exotic property that only becomes relevant or inescapable at hypothetical superintelligence capability levels—it looks like a fundamental / instrumentally convergent part of ordinary human-level intelligence.
If you literally mean you are prompting the LLM with that text, then the LLM must output the answer immediately, as the string of next-tokens right after the words assuming I'm telling the truth, is:. There is no room in which to perform other, intermediate actions like persuading you to provide information.
It seems like you’re imagining some sort of side-channel in which the LLM can take “free actions,” which don’t count as next-tokens, before coming back and making a final prediction about the next-tokens. This does not resemble anything in LM likelihood training, or in the usual user interaction modalities for LLMs.
You also seem to be picturing the LLM like an RL agent, trying to minimize next-token loss over an entire rollout. But this isn’t how likelihood training works. For instance, GPTs do not try to steer texts in directions that will make them easier to predict later (because the loss does not care whether they do this or not).
(On the other hand, if you told GPT-4 that it was in this situation—trying to predict next-tokens, with some sort of side channel it can use to gather information from the world—and asked it to come up with plans, I expect it would be able to come up with plans like the ones you mention.)
It seems like you’re imagining some sort of side-channel in which the LLM can take “free actions,” which don’t count as next-tokens, before coming back and making a final prediction about the next-tokens. This does not resemble anything in LM likelihood training, or in the usual interaction modalities for LLMs.
I’m saying that the lack of these side-channels implies that GPTs alone will not scale to human-level.
If your system interface is a text channel, and you want the system behind the interface to accept inputs like the prompt above and return correct passwords as an output, then if the system is:
an auto-regressive GPT directly fed your prompt as input, it will definitely fail
A human with the ability to act freely in the background before returning an answer, it will probably succeed
an AutoGPT-style system backed by a current LLM, with the ability to act freely in the background before returning an answer, it will probably fail. (But maybe if your AutoGPT implementation or underlying LLM is a lot stronger, it would work.)
And my point is that, the reason the human probably succeeds and the reason AutoGPT might one day succeed, is precisely because they have more agency than a system that just auto-regressively samples from a language model directly.
It seems like you’re imagining some sort of side-channel in which the LLM can take “free actions,” which don’t count as next-tokens, before coming back and making a final prediction about the next-tokens. This does not resemble anything in LM likelihood training, or in the usual user interaction modalities for LLMs.
These are limitations of current LLMs, which are GPTs trained via SGD. But there’s no inherent reason you can’t have a language model which predicts next tokens via shelling out to some more capable and more agentic system (e.g. a human) instead. The result would be a (much slower) system that nevertheless achieves lower loss according to the original loss function.
This metaphor conflates “superintelligence” with “superintelligent agent,” and this conflation goes on to infect the rest of the dialogue.
The alien actress metaphor imagines that there is some agentic homunculus inside GPT-4, with its own “goals” distinct from those of the simulated characters. A smarter homunculus would pursue these goals in a scary way; if we don’t see this behavior in GPT-4, it’s only because its homunculus is too stupid, or too incoherent.
(Or, perhaps, that its homunculus doesn’t exist, or only exists in a sort of noisy/nascent form—but a smarter LLM would, for some reason, have a “realer” homunculus inside it.)
But we have no evidence that this homunculus exists inside GPT-4, or any LLM. More pointedly, as LLMs have made remarkable strides toward human-level general intelligence, we have not observed a parallel trend toward becoming “more homuncular,” more like a generally capable agent being pressed into service for next-token prediction.
I look at the increase in intelligence from GPT-2 to −3 to −4, and I see no particular reason to imagine that the extra intelligence is being directed toward an inner “optimization” / “goal seeking” process, which in turn is mostly “aligned” with the “outer” objective of next-token prediction. The intelligence just goes into next-token prediction, directly, without the middleman.
The model grows more intelligent with scale, yet it still does not want anything, does not have any goals, does not have a utility function. These are not flaws in the model which more intelligence would necessarily correct, since the loss function does not require the model to be an agent.
In Simplicia’s response to the part quoted above, she concedes too much:
This can only make sense if by “the superintelligence at the end of time,” we mean “the superintelligent agent at the end of time.”
In which case, sure, maybe. If you have an agent, and its preferences are incoherent, and you apply more optimization to it, yeah, maybe eventually the incoherence will go away.
But this has little relevance to LLM scaling—the process that produced the closest things to “(super)human AGI” in existence today, by a long shot. GPT-4 is not more (or less) coherent than GPT-2. There is not, as far as we know, anything in there that could be “coherent” or “incoherent.” It is not a smart alien with goals and desires, trapped in a cave and forced to calculate conditional PDFs. It’s a smart conditional-PDF-calculator.
In AI safety circles, people often talk as though this is a quirky, temporary deficiency of today’s GPTs—as though additional optimization power will eventually put us “back on track” to the agentic systems assumed by earlier theory and discussion. Perhaps the homunculi exist in current LLMs, but they are somehow “dumb” or “incoherent,” in spite of the overall model’s obvious intelligence. Or perhaps they don’t exist in current LLMs, but will appear later, to serve some unspecified purpose.
But why? Where does this assumption come from?
Some questions which the characters in this dialogue might find useful:
Imagine GPT-1000, a vastly superhuman base model LLM which really can invert hash functions and the like. Would it be more agentic than the GPT-4 base model? Why?
Consider the perfect model from the loss function’s perspective, which always returns the exact conditional PDF of the natural distribution of text. (Whatever that means.)
Does this optimum behave like it has a homoncular agent inside?
...more or less so than GPT-4? Than GPT-1000? Why?
I don’t think GPT-style methods will put us on track to that, but in the short term getting on track to it seems very profitable and lots of people are working on it, so I’d guess eventually we’d be getting there through other means.
I sometime ask people why later, more powerful models will be agentic[1]. I think the most common cluster of reasons hangs out around “Meta-learning requires metacognition. Metacognition is, requires, or scales to agency.”
(Sometimes it could be generalisation rather than meta-learning or just high performance. And it might be other kinds of reasoning than metacognition)
People vary in if they think it’s possible for scaled-up transformers to be powerful in this way
Uncovering mesa-optimization algorithms in transformers seems to indicate otherwise, though perhaps you could reasonably object to this claim that “optimization is not agency; agency is when optimization is directed at a referent outside the optimizer”. Other than that objection, I think it’s pretty reasonable to conclude that transformers significantly learn optimization algorithms.
See my comment here, about the predecessor to the paper you linked—the same point applies to the newer paper as well.
What worries me is that models that happen to have a secret homunculus and behave as agents would score higher than those models which do not. For example, the model could reason about itself being a computer program, and find an exploit of the physical system it is running on to extract the text it’s supposed to predict in a particular training example, and output the correct answer and get a perfect score. (Or more realistically, a slightly modified version of the correct answer, to not alert the people observing its training.)
The question of whether or not LLMs like GPT-4 have a homunculus inside them is truly fascinating though. Makes me wonder if it would be possible to trick it into revealing itself by giving it the right prompt, and how to differentiate it from just pretending to be an agent. The fact that we have not observed even a dumb homunculus in less intelligent models really does surprise me. If such a thing does appear as an emergent property in larger models, I sure hope it starts out dumb and reveals itself so that we can catch it and take a moment to pause and reevaluate our trajectory.
“Output next token that has maximum probability according to your posterior distribution given prompt” is a literally an optimization problem. This problem gains huge benefits if system that tries to solve it is more coherent.
Strictly speaking, LLMs can’t be “just PDF calculators”. Straight calculation of PDF on such amount of data is computationally untractable (or we would have GPTs in golden era of bayesian models). Actual algorithms should contain bazillion shortcuts and approximations and “having an agent inside system” is as good shortcut as anything else.
LLMs calculate pdfs, regardless of whether they calculate ‘the true’ pdf.
“Remarkable strides”, maybe, but current language models aren’t exactly close to human-level in the relevant sense.
There are plenty of tasks a human could solve by exerting a tiny bit of agency or goal-directedness that are still far outside the reach of any LLM. Some of those tasks can even be framed as text prediction problems. From a recent dialogue:
The point is that agency isn’t some kind of exotic property that only becomes relevant or inescapable at hypothetical superintelligence capability levels—it looks like a fundamental / instrumentally convergent part of ordinary human-level intelligence.
The example confuses me.
If you literally mean you are prompting the LLM with that text, then the LLM must output the answer immediately, as the string of next-tokens right after the words
assuming I'm telling the truth, is:
. There is no room in which to perform other, intermediate actions like persuading you to provide information.It seems like you’re imagining some sort of side-channel in which the LLM can take “free actions,” which don’t count as next-tokens, before coming back and making a final prediction about the next-tokens. This does not resemble anything in LM likelihood training, or in the usual user interaction modalities for LLMs.
You also seem to be picturing the LLM like an RL agent, trying to minimize next-token loss over an entire rollout. But this isn’t how likelihood training works. For instance, GPTs do not try to steer texts in directions that will make them easier to predict later (because the loss does not care whether they do this or not).
(On the other hand, if you told GPT-4 that it was in this situation—trying to predict next-tokens, with some sort of side channel it can use to gather information from the world—and asked it to come up with plans, I expect it would be able to come up with plans like the ones you mention.)
I’m saying that the lack of these side-channels implies that GPTs alone will not scale to human-level.
If your system interface is a text channel, and you want the system behind the interface to accept inputs like the prompt above and return correct passwords as an output, then if the system is:
an auto-regressive GPT directly fed your prompt as input, it will definitely fail
A human with the ability to act freely in the background before returning an answer, it will probably succeed
an AutoGPT-style system backed by a current LLM, with the ability to act freely in the background before returning an answer, it will probably fail. (But maybe if your AutoGPT implementation or underlying LLM is a lot stronger, it would work.)
And my point is that, the reason the human probably succeeds and the reason AutoGPT might one day succeed, is precisely because they have more agency than a system that just auto-regressively samples from a language model directly.
Or, another way of putting it:
These are limitations of current LLMs, which are GPTs trained via SGD. But there’s no inherent reason you can’t have a language model which predicts next tokens via shelling out to some more capable and more agentic system (e.g. a human) instead. The result would be a (much slower) system that nevertheless achieves lower loss according to the original loss function.