The language monkeys paper is the reason I’m extremely suspicious of any observed failures to elicit a capability in a model serving as evidence of its absence. What is it that you know, that leads you to think that “SGD just doesn’t “want” to teach LLMs agency”? Chatbot training elicits some things, verifiable task RL training elicits some other things (which weren’t obviously there, weren’t trivial to find, but findings of the s1 paper suggest that they are mostly elicited, not learned, since mere 1000 traces are sufficient to transfer the capabilities). Many more things are buried just beneath the surface, waiting for the right reward signal to cheaply bring them up, putting them in control of the model’s behavior.
What is it that you know, that leads you to think that “SGD just doesn’t “want” to teach LLMs agency”?
Mostly the fact that it hasn’t happened already, on a merely “base” model. The fact that CoTs can improve models’ problem-solving ability has been known basically since the beginning, but there’s been no similar hacks found for jury-rigging agenty or insightful characters. (Right? I may have missed it, but even janus’ community doesn’t seem to have anything like it.)
But yes, the possibility that none of the current training loops happened to elicit it, and the next dumb trick will Just Work, is very much salient. That’s where my other 20% are at.
I’d say long reasoning wasn’t really elicited by CoT prompting, and that you can elicit agency to about the same extent now (i.e. hopelessly unconvincingly). It was only elicited with verifiable task RL training, and only now are there novel artifacts like s1′s 1K traces dataset that do elicit it convincingly, that weren’t available as evidence before.
It’s possible that as you say agency is unusually poorly learned in the base models, but I think failure to elicit is not the way to learn about whether it’s the case. Some futuristic interpretability work might show this, the same kind of work that can declare a GPT-4.5 scale model safe to release in open weights (unable to help with bioweapons or take over the world and such). We’ll probably get an open weights Llama-4 anyway, and some time later there will be novel 1K trace datasets that unlock things that were apparently impossible for it to do at the time of release.
I was to a significant extent responding to your “It’s possible that I’m wrong and base GPT-5/6 paperclips us”, which is not what my hypothesis predicts. If you can’t elicit a capability, it won’t be able to take control of model’s behavior, so a base model won’t be doing anything even if you are wrong in the way I’m framing this and the capability is there, finetuning on 1K traces away from taking control. It does still really need those 1K traces or else it never emerges at any reasonable scale, that is you might need a GPT-8 for it to spontaneously emerge in a base model, demonstrating that it was in GPT-5.5 all along, and making it possible to create the 1K traces that elicit it from GPT-5.5. While at the same time a clever method like R1-Zero would’ve been able to elicit it from GPT-5.5 directly, without needing a GPT-8.
I’d say long reasoning wasn’t really elicited by CoT prompting
IIRC, “let’s think step-by-step” showed up in benchmark performance basically immediately, and that’s the core of it. On the other hand, there’s nothing like “be madly obsessed with your goal” that’s known to boost LLM performance in agent settings.
There were clear “signs of life” on extended inference-time reasoning; there are (to my knowledge) none on agent-like reasoning.
you might need a GPT-8 for it to spontaneously emerge in a base model, demonstrating that it was in GPT-5.5 all along
If you agree that it can spontaneously emerge at a sufficiently big scale, why would you assume this scale is GPT-8, not GPT-5?
That’s basically the core of my argument. If LLMs learned agency skills, they would’ve been elicitable in some GPT-N, with no particular reason to think that this N needs to be very big. On the contrary, by extrapolating a similarly qualitative jump from GPT-3 to GPT-4 as happened from GPT-2 to GPT-3, I’d expected these skills to spontaneously show up in GPT-4 – if they were ever going to show up.
They didn’t show up. GPT-4 ended up as a sharper-looking GPT-3.5, and all progress since then amounted to GPT-3.5′s shape being more sharply defined, without that shape changing.
IIRC, “let’s think step-by-step” showed up in benchmark performance basically immediately, and that’s the core of it.
It’s not central to the phenomenon I’m using as an example of a nontrivially elicited capability. There, the central thing is efficient CDCL-like in-context search that enumerates possibilities while generalizing blind alleys to explore similar blind alleys less within the same reasoning trace, which can get about as long as the whole effective context (on the order of 100K tokens). Prompted (as opposed to elicited-by-tuning) CoT won’t scale to arbitrarily long reasoning traces by adding “Wait” at the end of a reasoning trace either (Figure 3 of the s1 paper). Quantitatively, this manifests as scaling of benchmark outcomes with test-time compute that’s dramatically more efficient per token (Figure 4b of s1 paper) than the parallel scaling methods such as consensus/majority and best-of-N, or even PRM-based methods (Figure 3 of this Aug 2024 paper).
If you agree that it can spontaneously emerge at a sufficiently big scale, why would you assume this scale is GPT-8, not GPT-5?
I was just anchoring to your example that I was replying to where you sketch some stand-in capability (“paperclipping”) that doesn’t spontaneously emerge in “GPT-5/6″ (i.e. with mere prompting). I took that framing as it was given in your example and extended it to more scale (“GPT-8”) to sketch my own point, that I expect capabilities that can be elicited to emerge much later than the scale where they can be merely elicited (with finetuning on a tiny amount of data). It wasn’t my intent to meaningfully gesture at particular scales with respect to particular capabilities.
The language monkeys paper is the reason I’m extremely suspicious of any observed failures to elicit a capability in a model serving as evidence of its absence. What is it that you know, that leads you to think that “SGD just doesn’t “want” to teach LLMs agency”? Chatbot training elicits some things, verifiable task RL training elicits some other things (which weren’t obviously there, weren’t trivial to find, but findings of the s1 paper suggest that they are mostly elicited, not learned, since mere 1000 traces are sufficient to transfer the capabilities). Many more things are buried just beneath the surface, waiting for the right reward signal to cheaply bring them up, putting them in control of the model’s behavior.
Mostly the fact that it hasn’t happened already, on a merely “base” model. The fact that CoTs can improve models’ problem-solving ability has been known basically since the beginning, but there’s been no similar hacks found for jury-rigging agenty or insightful characters. (Right? I may have missed it, but even janus’ community doesn’t seem to have anything like it.)
But yes, the possibility that none of the current training loops happened to elicit it, and the next dumb trick will Just Work, is very much salient. That’s where my other 20% are at.
I’d say long reasoning wasn’t really elicited by CoT prompting, and that you can elicit agency to about the same extent now (i.e. hopelessly unconvincingly). It was only elicited with verifiable task RL training, and only now are there novel artifacts like s1′s 1K traces dataset that do elicit it convincingly, that weren’t available as evidence before.
It’s possible that as you say agency is unusually poorly learned in the base models, but I think failure to elicit is not the way to learn about whether it’s the case. Some futuristic interpretability work might show this, the same kind of work that can declare a GPT-4.5 scale model safe to release in open weights (unable to help with bioweapons or take over the world and such). We’ll probably get an open weights Llama-4 anyway, and some time later there will be novel 1K trace datasets that unlock things that were apparently impossible for it to do at the time of release.
I was to a significant extent responding to your “It’s possible that I’m wrong and base GPT-5/6 paperclips us”, which is not what my hypothesis predicts. If you can’t elicit a capability, it won’t be able to take control of model’s behavior, so a base model won’t be doing anything even if you are wrong in the way I’m framing this and the capability is there, finetuning on 1K traces away from taking control. It does still really need those 1K traces or else it never emerges at any reasonable scale, that is you might need a GPT-8 for it to spontaneously emerge in a base model, demonstrating that it was in GPT-5.5 all along, and making it possible to create the 1K traces that elicit it from GPT-5.5. While at the same time a clever method like R1-Zero would’ve been able to elicit it from GPT-5.5 directly, without needing a GPT-8.
IIRC, “let’s think step-by-step” showed up in benchmark performance basically immediately, and that’s the core of it. On the other hand, there’s nothing like “be madly obsessed with your goal” that’s known to boost LLM performance in agent settings.
There were clear “signs of life” on extended inference-time reasoning; there are (to my knowledge) none on agent-like reasoning.
If you agree that it can spontaneously emerge at a sufficiently big scale, why would you assume this scale is GPT-8, not GPT-5?
That’s basically the core of my argument. If LLMs learned agency skills, they would’ve been elicitable in some GPT-N, with no particular reason to think that this N needs to be very big. On the contrary, by extrapolating a similarly qualitative jump from GPT-3 to GPT-4 as happened from GPT-2 to GPT-3, I’d expected these skills to spontaneously show up in GPT-4 – if they were ever going to show up.
They didn’t show up. GPT-4 ended up as a sharper-looking GPT-3.5, and all progress since then amounted to GPT-3.5′s shape being more sharply defined, without that shape changing.
It’s not central to the phenomenon I’m using as an example of a nontrivially elicited capability. There, the central thing is efficient CDCL-like in-context search that enumerates possibilities while generalizing blind alleys to explore similar blind alleys less within the same reasoning trace, which can get about as long as the whole effective context (on the order of 100K tokens). Prompted (as opposed to elicited-by-tuning) CoT won’t scale to arbitrarily long reasoning traces by adding “Wait” at the end of a reasoning trace either (Figure 3 of the s1 paper). Quantitatively, this manifests as scaling of benchmark outcomes with test-time compute that’s dramatically more efficient per token (Figure 4b of s1 paper) than the parallel scaling methods such as consensus/majority and best-of-N, or even PRM-based methods (Figure 3 of this Aug 2024 paper).
I was just anchoring to your example that I was replying to where you sketch some stand-in capability (“paperclipping”) that doesn’t spontaneously emerge in “GPT-5/6″ (i.e. with mere prompting). I took that framing as it was given in your example and extended it to more scale (“GPT-8”) to sketch my own point, that I expect capabilities that can be elicited to emerge much later than the scale where they can be merely elicited (with finetuning on a tiny amount of data). It wasn’t my intent to meaningfully gesture at particular scales with respect to particular capabilities.