Not to be a scaling-law denier. I believe in them, I do! But they measure perplexity, not general intelligence/real-world usefulness, and Goodhart’s Law is no-one’s ally.
If we’re able to get perplexity sufficiently low on text samples that I write, then that means the LLM has a lot of the important algorithms running in it that are running in me. The text I write is causally downstream from parts of me that are reflective and self-improving, that notice the little details in my cognitive processes and environment, and the parts of me that are capable of pursuing goals for a long inferential distance. An LLM agent which can mirror those properties (which we do not yet have the capabilities for) seems like it would very plausibly become a very strong agent in a way that we haven’t seen before.
The phenomenon of perplexity getting lower is made up of LLMs increasingly grokking different and new parts of the generating algorithm behind the text. I think the failure in agents that we’ve seen so far is explainable by the fact that they do not yet grok the things that agency is made of, and the future disruption of that trend is explainable as a consequence of “perplexity over my writing gets lower past the threshold where faithfully emulating my reflectivity and agency algorithms is necessary.”
(This perplexity argument about reflectivity etc. is roughly equivalent to one of the arguments that Eliezer gave on Dwarkesh.)
Sure, but “sufficiently low” is doing a lot of work here. In practice, a “cheaper” way to decrease perplexity is to go for the breadth (memorizing random facts), not the depth. In the limit of perfect prediction, yes, GPT-N would have to have learned agency. But the actual LLM training loops may be a ruinously compute-inefficient way to approach that limit – and indeed, they seem to be.
My current impression is that the SGD just doesn’t “want” to teach LLMs agency for some reason, and we’re going to run out of compute/data long before it’s forced to. It’s possible that I’m wrong and base GPT-5/6 paperclips us, sure. But I think if that were going to happen, it would’ve happened at GPT-4 (indeed, IIRC that was what I’d dreaded from it).
The language monkeys paper is the reason I’m extremely suspicious of any observed failures to elicit a capability in a model serving as evidence of its absence. What is it that you know, that leads you to think that “SGD just doesn’t “want” to teach LLMs agency”? Chatbot training elicits some things, verifiable task RL training elicits some other things (which weren’t obviously there, weren’t trivial to find, but findings of the s1 paper suggest that they are mostly elicited, not learned, since mere 1000 traces are sufficient to transfer the capabilities). Many more things are buried just beneath the surface, waiting for the right reward signal to cheaply bring them up, putting them in control of the model’s behavior.
What is it that you know, that leads you to think that “SGD just doesn’t “want” to teach LLMs agency”?
Mostly the fact that it hasn’t happened already, on a merely “base” model. The fact that CoTs can improve models’ problem-solving ability has been known basically since the beginning, but there’s been no similar hacks found for jury-rigging agenty or insightful characters. (Right? I may have missed it, but even janus’ community doesn’t seem to have anything like it.)
But yes, the possibility that none of the current training loops happened to elicit it, and the next dumb trick will Just Work, is very much salient. That’s where my other 20% are at.
I’d say long reasoning wasn’t really elicited by CoT prompting, and that you can elicit agency to about the same extent now (i.e. hopelessly unconvincingly). It was only elicited with verifiable task RL training, and only now are there novel artifacts like s1′s 1K traces dataset that do elicit it convincingly, that weren’t available as evidence before.
It’s possible that as you say agency is unusually poorly learned in the base models, but I think failure to elicit is not the way to learn about whether it’s the case. Some futuristic interpretability work might show this, the same kind of work that can declare a GPT-4.5 scale model safe to release in open weights (unable to help with bioweapons or take over the world and such). We’ll probably get an open weights Llama-4 anyway, and some time later there will be novel 1K trace datasets that unlock things that were apparently impossible for it to do at the time of release.
I was to a significant extent responding to your “It’s possible that I’m wrong and base GPT-5/6 paperclips us”, which is not what my hypothesis predicts. If you can’t elicit a capability, it won’t be able to take control of model’s behavior, so a base model won’t be doing anything even if you are wrong in the way I’m framing this and the capability is there, finetuning on 1K traces away from taking control. It does still really need those 1K traces or else it never emerges at any reasonable scale, that is you might need a GPT-8 for it to spontaneously emerge in a base model, demonstrating that it was in GPT-5.5 all along, and making it possible to create the 1K traces that elicit it from GPT-5.5. While at the same time a clever method like R1-Zero would’ve been able to elicit it from GPT-5.5 directly, without needing a GPT-8.
I’d say long reasoning wasn’t really elicited by CoT prompting
IIRC, “let’s think step-by-step” showed up in benchmark performance basically immediately, and that’s the core of it. On the other hand, there’s nothing like “be madly obsessed with your goal” that’s known to boost LLM performance in agent settings.
There were clear “signs of life” on extended inference-time reasoning; there are (to my knowledge) none on agent-like reasoning.
you might need a GPT-8 for it to spontaneously emerge in a base model, demonstrating that it was in GPT-5.5 all along
If you agree that it can spontaneously emerge at a sufficiently big scale, why would you assume this scale is GPT-8, not GPT-5?
That’s basically the core of my argument. If LLMs learned agency skills, they would’ve been elicitable in some GPT-N, with no particular reason to think that this N needs to be very big. On the contrary, by extrapolating a similarly qualitative jump from GPT-3 to GPT-4 as happened from GPT-2 to GPT-3, I’d expected these skills to spontaneously show up in GPT-4 – if they were ever going to show up.
They didn’t show up. GPT-4 ended up as a sharper-looking GPT-3.5, and all progress since then amounted to GPT-3.5′s shape being more sharply defined, without that shape changing.
IIRC, “let’s think step-by-step” showed up in benchmark performance basically immediately, and that’s the core of it.
It’s not central to the phenomenon I’m using as an example of a nontrivially elicited capability. There, the central thing is efficient CDCL-like in-context search that enumerates possibilities while generalizing blind alleys to explore similar blind alleys less within the same reasoning trace, which can get about as long as the whole effective context (on the order of 100K tokens). Prompted (as opposed to elicited-by-tuning) CoT won’t scale to arbitrarily long reasoning traces by adding “Wait” at the end of a reasoning trace either (Figure 3 of the s1 paper). Quantitatively, this manifests as scaling of benchmark outcomes with test-time compute that’s dramatically more efficient per token (Figure 4b of s1 paper) than the parallel scaling methods such as consensus/majority and best-of-N, or even PRM-based methods (Figure 3 of this Aug 2024 paper).
If you agree that it can spontaneously emerge at a sufficiently big scale, why would you assume this scale is GPT-8, not GPT-5?
I was just anchoring to your example that I was replying to where you sketch some stand-in capability (“paperclipping”) that doesn’t spontaneously emerge in “GPT-5/6″ (i.e. with mere prompting). I took that framing as it was given in your example and extended it to more scale (“GPT-8”) to sketch my own point, that I expect capabilities that can be elicited to emerge much later than the scale where they can be merely elicited (with finetuning on a tiny amount of data). It wasn’t my intent to meaningfully gesture at particular scales with respect to particular capabilities.
Agency and reflectivity are phenomena that are really broadly applicable, and I think it’s unlikely that memorizing a few facts is the way that that’ll happen. Those traits are more concentrated in places like LessWrong, but they’re almost everywhere. I think to go from “fits the vibe of internet text and absorbs some of the reasoning” to “actually creates convincing internet text,” you need more agency and reflectivity.
My impression is that “memorize more random facts and overfit” is less efficient for reducing perplexity than “learn something that generalizes,” for these sorts of generating algorithms that are everywhere. There’s a reason we see “approximate addition” instead of “memorize every addition problem” or “learn webdev” instead of “memorize every website.”
The RE-bench numbers for task time horizon just keep going up, and I expect them to continue as models continue to gain bits and pieces of the complex machinery required for operating coherently over long time horizons.
As for when we run out of data, I encourage you to look at this piece from Epoch. We run out of RL signal for R&D tasks even later than that.
If we’re able to get perplexity sufficiently low on text samples that I write, then that means the LLM has a lot of the important algorithms running in it that are running in me. The text I write is causally downstream from parts of me that are reflective and self-improving, that notice the little details in my cognitive processes and environment, and the parts of me that are capable of pursuing goals for a long inferential distance. An LLM agent which can mirror those properties (which we do not yet have the capabilities for) seems like it would very plausibly become a very strong agent in a way that we haven’t seen before.
The phenomenon of perplexity getting lower is made up of LLMs increasingly grokking different and new parts of the generating algorithm behind the text. I think the failure in agents that we’ve seen so far is explainable by the fact that they do not yet grok the things that agency is made of, and the future disruption of that trend is explainable as a consequence of “perplexity over my writing gets lower past the threshold where faithfully emulating my reflectivity and agency algorithms is necessary.”
(This perplexity argument about reflectivity etc. is roughly equivalent to one of the arguments that Eliezer gave on Dwarkesh.)
Sure, but “sufficiently low” is doing a lot of work here. In practice, a “cheaper” way to decrease perplexity is to go for the breadth (memorizing random facts), not the depth. In the limit of perfect prediction, yes, GPT-N would have to have learned agency. But the actual LLM training loops may be a ruinously compute-inefficient way to approach that limit – and indeed, they seem to be.
My current impression is that the SGD just doesn’t “want” to teach LLMs agency for some reason, and we’re going to run out of compute/data long before it’s forced to. It’s possible that I’m wrong and base GPT-5/6 paperclips us, sure. But I think if that were going to happen, it would’ve happened at GPT-4 (indeed, IIRC that was what I’d dreaded from it).
The language monkeys paper is the reason I’m extremely suspicious of any observed failures to elicit a capability in a model serving as evidence of its absence. What is it that you know, that leads you to think that “SGD just doesn’t “want” to teach LLMs agency”? Chatbot training elicits some things, verifiable task RL training elicits some other things (which weren’t obviously there, weren’t trivial to find, but findings of the s1 paper suggest that they are mostly elicited, not learned, since mere 1000 traces are sufficient to transfer the capabilities). Many more things are buried just beneath the surface, waiting for the right reward signal to cheaply bring them up, putting them in control of the model’s behavior.
Mostly the fact that it hasn’t happened already, on a merely “base” model. The fact that CoTs can improve models’ problem-solving ability has been known basically since the beginning, but there’s been no similar hacks found for jury-rigging agenty or insightful characters. (Right? I may have missed it, but even janus’ community doesn’t seem to have anything like it.)
But yes, the possibility that none of the current training loops happened to elicit it, and the next dumb trick will Just Work, is very much salient. That’s where my other 20% are at.
I’d say long reasoning wasn’t really elicited by CoT prompting, and that you can elicit agency to about the same extent now (i.e. hopelessly unconvincingly). It was only elicited with verifiable task RL training, and only now are there novel artifacts like s1′s 1K traces dataset that do elicit it convincingly, that weren’t available as evidence before.
It’s possible that as you say agency is unusually poorly learned in the base models, but I think failure to elicit is not the way to learn about whether it’s the case. Some futuristic interpretability work might show this, the same kind of work that can declare a GPT-4.5 scale model safe to release in open weights (unable to help with bioweapons or take over the world and such). We’ll probably get an open weights Llama-4 anyway, and some time later there will be novel 1K trace datasets that unlock things that were apparently impossible for it to do at the time of release.
I was to a significant extent responding to your “It’s possible that I’m wrong and base GPT-5/6 paperclips us”, which is not what my hypothesis predicts. If you can’t elicit a capability, it won’t be able to take control of model’s behavior, so a base model won’t be doing anything even if you are wrong in the way I’m framing this and the capability is there, finetuning on 1K traces away from taking control. It does still really need those 1K traces or else it never emerges at any reasonable scale, that is you might need a GPT-8 for it to spontaneously emerge in a base model, demonstrating that it was in GPT-5.5 all along, and making it possible to create the 1K traces that elicit it from GPT-5.5. While at the same time a clever method like R1-Zero would’ve been able to elicit it from GPT-5.5 directly, without needing a GPT-8.
IIRC, “let’s think step-by-step” showed up in benchmark performance basically immediately, and that’s the core of it. On the other hand, there’s nothing like “be madly obsessed with your goal” that’s known to boost LLM performance in agent settings.
There were clear “signs of life” on extended inference-time reasoning; there are (to my knowledge) none on agent-like reasoning.
If you agree that it can spontaneously emerge at a sufficiently big scale, why would you assume this scale is GPT-8, not GPT-5?
That’s basically the core of my argument. If LLMs learned agency skills, they would’ve been elicitable in some GPT-N, with no particular reason to think that this N needs to be very big. On the contrary, by extrapolating a similarly qualitative jump from GPT-3 to GPT-4 as happened from GPT-2 to GPT-3, I’d expected these skills to spontaneously show up in GPT-4 – if they were ever going to show up.
They didn’t show up. GPT-4 ended up as a sharper-looking GPT-3.5, and all progress since then amounted to GPT-3.5′s shape being more sharply defined, without that shape changing.
It’s not central to the phenomenon I’m using as an example of a nontrivially elicited capability. There, the central thing is efficient CDCL-like in-context search that enumerates possibilities while generalizing blind alleys to explore similar blind alleys less within the same reasoning trace, which can get about as long as the whole effective context (on the order of 100K tokens). Prompted (as opposed to elicited-by-tuning) CoT won’t scale to arbitrarily long reasoning traces by adding “Wait” at the end of a reasoning trace either (Figure 3 of the s1 paper). Quantitatively, this manifests as scaling of benchmark outcomes with test-time compute that’s dramatically more efficient per token (Figure 4b of s1 paper) than the parallel scaling methods such as consensus/majority and best-of-N, or even PRM-based methods (Figure 3 of this Aug 2024 paper).
I was just anchoring to your example that I was replying to where you sketch some stand-in capability (“paperclipping”) that doesn’t spontaneously emerge in “GPT-5/6″ (i.e. with mere prompting). I took that framing as it was given in your example and extended it to more scale (“GPT-8”) to sketch my own point, that I expect capabilities that can be elicited to emerge much later than the scale where they can be merely elicited (with finetuning on a tiny amount of data). It wasn’t my intent to meaningfully gesture at particular scales with respect to particular capabilities.
Agency and reflectivity are phenomena that are really broadly applicable, and I think it’s unlikely that memorizing a few facts is the way that that’ll happen. Those traits are more concentrated in places like LessWrong, but they’re almost everywhere. I think to go from “fits the vibe of internet text and absorbs some of the reasoning” to “actually creates convincing internet text,” you need more agency and reflectivity.
My impression is that “memorize more random facts and overfit” is less efficient for reducing perplexity than “learn something that generalizes,” for these sorts of generating algorithms that are everywhere. There’s a reason we see “approximate addition” instead of “memorize every addition problem” or “learn webdev” instead of “memorize every website.”
The RE-bench numbers for task time horizon just keep going up, and I expect them to continue as models continue to gain bits and pieces of the complex machinery required for operating coherently over long time horizons.
As for when we run out of data, I encourage you to look at this piece from Epoch. We run out of RL signal for R&D tasks even later than that.