Thanks! Time will tell who is right. Point by point reply:
You list four things AIs seem stubbornly bad at: 1. Innovation. 2. Reliability. 3. Solving non-templated problems. 4. Compounding returns on problem-solving-time.
First of all, 2 and 4 seem closely related to me. I would say: “Agency skills” are the skills key to being an effective agent, i.e. skills useful for operating autonomously for long periods in pursuit of goals. Noticing when you are stuck is a simple example of an agency skill. Planning is another simple example. In-context learning is another example. would say that current AIs lack agency skills, and that 2 and 4 are just special cases of this. I would also venture to guess with less confidence that 1 and 3 might be because of this as well—perhaps the reason AIs haven’t made any truly novel innovations yet is that doing so takes intellectual work, work they can’t do because they can’t operate autonomously for long periods in pursuit of goals. (Note that reasoning models like o1 are a big leap in the direction of being able to do this!) And perhaps the reason behind the relatively poor performance on non-templated tasks is… wait actually no, that one has a very easy separate explanation, which is that they’ve been trained less on those tasks. A human, too, is better at stuff they’ve done a lot.
Secondly, and more importantly, I don’t think we can say there has been ~0 progress on these dimensions in the last few years, whether you conceive of them in your way or my way. Progress is in general s-curvy; adoption curves are s-curvy. Suppose for example that GPT2 was 4 SDs worse than average human at innovation, reliability, etc. and GPT3 was 3 SDs worse and GPT4 was 2 SDs worse and o1 is 1 SD worse. Under this supposition, the world would look the way that it looks today—Thane would notice zero novel innovations from AIs, Thane would have friends who try to use o1 for coding and find that it’s not useful without templates, etc. Meanwhile, as I’m sure you are aware pretty much every benchmark anyone has ever made has shown rapid progress in the last few years—including benchmarks made by METR who was specifically trying to measure AI R&D ability and agency abilities, and which genuinely do seem to require (small) amounts of agency. So I think the balance of evidence is in favor of progress on the dimensions you are talking about—it just hasn’t reached human level yet, or at any rate not the level at which you’d notice big exciting changes in the world. (Analogous to: Suppose we’ve measured COVID in some countries but not others, and found that in every country we’ve measured, COVID has spread to about 0.01% − 0.001% of the population, and is growing exponentially. If we live in a country that hasn’t measured yet, we should assume COVID is spreading even though we don’t know anyone personally who is sick yet.)
...
You say:
My model is that all LLM progress so far has involved making LLMs better at the “top-down” thing. They end up with increasingly bigger databases of template problems, the closest-match templates end up ever-closer to the actual problems they’re facing, their ability to fill-in the details becomes ever-richer, etc. This improves their zero-shot skills, and test-time compute scaling allows them to “feel out” the problem’s shape over an extended period and find an ever-more-detailed top-down fit.
But it’s still fundamentally not what humans do. Humans are able to instantiate a completely new abstract model of a problem – even if it’s initially based on a stored template – and chisel at it until it matches the actual problem near-perfectly. This allows them to be much more reliable; this allows them to keep themselves on-track; this allows them to find “genuinely new” innovations.
Top down vs. bottom-up seem like two different ways of solving intellectual problems. Do you think it’s a sharp binary distinction? Or do you think it’s a spectrum? If the latter, what makes you think o1 isn’t farther along the spectrum than GPT3? If the former—if it’s a sharp binary—can you say what it is about LLM architecture and/or training methods that renders them incapable of thinking in the bottom-up way? (Like, naively it seems like o1 can do sophisticated reasoning. Moreover, it seems like it was trained in a way that would incentivize it to learn skills useful for solving math problems, and ‘bottom-up reasoning’ seems like a skill that would be useful. Why wouldn’t it learn it?)
Can you describe an intellectual or practical feat, or ideally a problem set, such that if AI solves it in 2025 you’ll update significantly towards my position?
I would also venture to guess with less confidence that 1 and 3 might be because of this as well
Agreed, I do expect that the performance on all of those is mediated by the same variable(s); that’s why I called them a “cluster”.
benchmarks made by METR who was specifically trying to measure AI R&D ability and agency abilities, and which genuinely do seem to require (small) amounts of agency
I think “agency” is a bit of an overly abstract/confusing term to use, here. In particular, I think it also allows both a “top-down” and a “bottom-up” approach.
Humans have “bottom-up” agency: they’re engaging in fluid-intelligence problem-solving and end up “drawing” a decision-making pattern of a specific shape. An LLM, on this model, has a database of templates for such decision-making patterns, and it retrieves the best-fit agency template for whatever problem it’s facing. o1/RL-on-CoTs is a way to deliberately target the set of agency-templates an LLM has, extending it. But it doesn’t change the ultimate nature of what’s happening.
In particular: the bottom-up approach would allow an agent to stay on-target for an arbitrarily long time, creating an arbitrarily precise fit for whatever problem it’s facing. An LLM’s ability to stay on-target, however, would always remain limited by the length and the expressiveness of the templates that were trained into it.
RL on CoTs is a great way to further mask the problem, which is why the o-series seems to make unusual progress on agency-measuring benchmarks. But it’s still just masking it.
can you say what it is about LLM architecture and/or training methods that renders them incapable of thinking in the bottom-up way?
Not sure. I think it might be some combination of “the pretraining phase moves the model deep into the local-minimum abyss of top-down cognition, and the cheaper post-training phase can never hope to get it out of there” and “the LLM architecture sucks, actually”. But I would rather not get into the specifics.
Can you describe an intellectual or practical feat, or ideally a problem set, such that if AI solves it in 2025 you’ll update significantly towards my position?
“Inventing a new field of science” would do it, as far as more-or-less legible measures go. Anything less than that is too easily “fakeable” using top-down reasoning.
That said, I may make this update based on less legible vibes-based evidence, such as if o3′s advice on real-life problems seems to be unusually lucid and creative. (I’m tracking the possibility that LLMs are steadily growing in general capability and that they simply haven’t yet reached the level that impresses me personally. But on balance, I mostly don’t expect this possibility to be realized.)
“Inventing a new field of science” would do it, as far as more-or-less legible measures go. Anything less than that is too easily “fakeable” using top-down reasoning.
Seems unlikely we’ll see this before stuff gets seriously crazy on anyone’s views. (Has any new field of science been invented in the last 5 years by humans? I’m not sure what you’d count.)
It seems like we should at least update towards AIs being very useful for accelerating AI R&D if we very clearly see AI R&D greatly accelerate and it is using tons of AI labor. (And this was the initial top level prompt for this thread.) We could say something similar about other types of research.
Seems unlikely we’ll see this before stuff gets seriously crazy on anyone’s views. (Has any new field of science been invented in the last 5 years? I’m not sure what you’d count.)
Maybe some minor science fields, but yeah entirely new science fields in 5 years is deep into ASI territory, assuming it’s something like a hard science like physics.
(I’m tracking the possibility that LLMs are steadily growing in general capability and that they simply haven’t yet reached the level that impresses me personally. But on balance, I mostly don’t expect this possibility to be realized.)
That possibility is what I believe. I wish we had something to bet on better than “inventing a new field of science,” because by the time we observe that, there probably won’t be much time left to do anything about it. What about e.g. “I, Daniel Kokotajlo, are able to use AI agents basically as substitutes for human engineer/programmer employees. I, as a non-coder, can chat with them and describe ML experiments I want them to run or websites I want them to build etc., and they’ll make it happen at least as quickly and well as a competent professional would.” (And not just for simple websites, for the kind of experiments I’d want to run, which aren’t the most complicated but they aren’t that different from things actual AI company engineers would be doing.)
What about “The model is seemingly as good at solving math problems and puzzles as Thane is, not just on average across many problems but on pretty much any specific problem including on novel ones that are unfamiliar to both of you?
Humans have “bottom-up” agency: they’re engaging in fluid-intelligence problem-solving and end up “drawing” a decision-making pattern of a specific shape. An LLM, on this model, has a database of templates for such decision-making patterns, and it retrieves the best-fit agency template for whatever problem it’s facing. o1/RL-on-CoTs is a way to deliberately target the set of agency-templates an LLM has, extending it. But it doesn’t change the ultimate nature of what’s happening.
In particular: the bottom-up approach would allow an agent to stay on-target for an arbitrarily long time, creating an arbitrarily precise fit for whatever problem it’s facing. An LLM’s ability to stay on-target, however, would always remain limited by the length and the expressiveness of the templates that were trained into it.
Miscellaneous thoughts: I don’t yet buy that this distinction between top-down and bottom-up is binary, and insofar as it’s a spectrum then I’d be willing to bet that there’s been progress along it in recent years. Moreover I’m not even convinced that this distinction matters much for generalization radius / general intelligence, and it’s even less likely to matter for ‘ability to 5x AI R&D’ which is the milestone I’m trying to predict first. Moreover, I don’t think humans stay on-target for an arbitrarily long time.
I wish we had something to bet on better than “inventing a new field of science,”
I’ve thought of one potential observable that is concrete, should be relatively low-capability, and should provoke a strong update towards your model for me:
If there is an AI model such that the complexity of R&D problems it can solve (1) scales basically boundlessly with the amount of serial compute provided to it (or to a “research fleet” based on it), (2) scales much faster with serial compute than with parallel compute, and (3) the required amount of human attention (“babysitting”) is constant or grows very slowly with the amount of serial compute.
This attempts to directly get at the “autonomous self-correction” and “ability to think about R&D problems strategically” ideas.
I’ve not fully thought through all possible ways reality could Goodhart to this benchmark, i. e. “technically” pass it but in a way I find unconvincing. For example, if I failed to include the condition (2), o3 would have probably already “passed” it (since it potentially achieved better performance on ARC-AGI and FrontierMath by sampling thousands of CoTs then outputting the most frequent answer). There might be other loopholes like this...
But it currently seems reasonable and True-Name-y to me.
What about “Daniel Kokotajlo can feed it his docs about some prosaic ML alignment agenda (e.g. the faithful CoT stuff) and then it can autonomously go off and implement the agenda and come back to him with a writeup of the results and takeaways. While working on this, it gets to check in with Daniel once a day for a brief 20-minute chat conversation.”
Does that seem to you like it’ll come earlier, or later, than the milestone you describe?
Prooobably ~simultaneously, but I can maybe see it coming earlier and in a way that isn’t wholly convincing to me. In particular, it would still be a fixed-length task; much longer-length than what the contemporary models can reliably manage today, but still hackable using poorly-generalizing “agency templates” instead of fully general “compact generators of agenty behavior” (which I speculate humans to have and RL’d LLMs not to). It would be some evidence in favor of “AI can accelerate AI R&D”, but not necessarily “LLMs trained via SSL+RL are AGI-complete”.
Actually, I can also see it coming later. For example, some suppose that the capability researchers invent some method for reliably-and-indefinitely extending the amount of serial computations a reasoning model can productively make use of, but the compute or memory requirements grow very fast with the length of a CoT. Some fairly solid empirical evidence and theoretical arguments in favor of boundless scaling can appear quickly, well before the algorithms are made optimal enough to (1) handle weeks-long CoTs and/or (2) allow wide adoption (thus making it available to you).
I think the second scenario is more plausible, actually.
OK. Next question: Suppose that next year we get a nice result showing that there is a model with serial inference-time scaling across e.g. MATH + FrontierMath + IMO problems. Recall that FrontierMath and IMO are subdivided into different difficulty levels; suppose that this model can be given e.g. 10 tokens of CoT, 100, 1000, 10,000, etc. and then somewhere around the billion-serial-token-level it starts solving a decent chunk of the “medium” FrontierMath problems (but not all) and at the million-serial-token level it was only solving MATH + some easy IMO problems.
Not for math benchmarks. Here’s one way it can “cheat” at them: suppose that the CoT would involve the model generating candidate proofs/derivations, then running an internal (learned, not hard-coded) proof verifier on them, and either rejecting the candidate proof and trying to generate a new one, or outputting it. We know that this is possible, since we know that proof verifiers can be compactly specified.
This wouldn’t actually show “agency” and strategic thinking of the kinds that might generalize to open-ended domains and “true” long-horizon tasks. In particular, this would mostly fail the condition (2) from my previous comment.
Something more open-ended and requiring “research taste” would be needed. Maybe a comparable performance on METR’s benchmark would work for this (i. e., the model can beat a significantly larger fraction of it at 1 billion tokens compared to 1 million)? Or some other benchmark that comes closer to evaluating real-world performance.
Edit: Oh, math-benchmark performance would convince me if we get access to a CoT sample and it shows that the model doesn’t follow the above “cheating” approach, but instead approaches the problem strategically (in some sense). (Which would also require this CoT not to be hopelessly steganographied, obviously.)
Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems? I feel like a few examples have been shown & they seem to involve qualitative thinking, not just brute-force-proof-search (though of course they show lots of failed attempts and backtracking—just like a human thought-chain would).
Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI (though I’m not sure)
Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems?
Certainly (experimenting with r1′s CoTs right now, in fact). I agree that they’re not doing the brute-force stuff I mentioned; that was just me outlining a scenario in which a system “technically” clears the bar you’d outlined, yet I end up unmoved (I don’t want to end up goalpost-moving).
Though neither are they being “strategic” in the way I expect they’d need to be in order to productively use a billion-token CoT.
Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI
Yeah, I’m also glad to finally have something concrete-ish to watch out for. Thanks for prompting me!
Thanks! Time will tell who is right. Point by point reply:
You list four things AIs seem stubbornly bad at: 1. Innovation. 2. Reliability. 3. Solving non-templated problems. 4. Compounding returns on problem-solving-time.
First of all, 2 and 4 seem closely related to me. I would say: “Agency skills” are the skills key to being an effective agent, i.e. skills useful for operating autonomously for long periods in pursuit of goals. Noticing when you are stuck is a simple example of an agency skill. Planning is another simple example. In-context learning is another example. would say that current AIs lack agency skills, and that 2 and 4 are just special cases of this. I would also venture to guess with less confidence that 1 and 3 might be because of this as well—perhaps the reason AIs haven’t made any truly novel innovations yet is that doing so takes intellectual work, work they can’t do because they can’t operate autonomously for long periods in pursuit of goals. (Note that reasoning models like o1 are a big leap in the direction of being able to do this!) And perhaps the reason behind the relatively poor performance on non-templated tasks is… wait actually no, that one has a very easy separate explanation, which is that they’ve been trained less on those tasks. A human, too, is better at stuff they’ve done a lot.
Secondly, and more importantly, I don’t think we can say there has been ~0 progress on these dimensions in the last few years, whether you conceive of them in your way or my way. Progress is in general s-curvy; adoption curves are s-curvy. Suppose for example that GPT2 was 4 SDs worse than average human at innovation, reliability, etc. and GPT3 was 3 SDs worse and GPT4 was 2 SDs worse and o1 is 1 SD worse. Under this supposition, the world would look the way that it looks today—Thane would notice zero novel innovations from AIs, Thane would have friends who try to use o1 for coding and find that it’s not useful without templates, etc. Meanwhile, as I’m sure you are aware pretty much every benchmark anyone has ever made has shown rapid progress in the last few years—including benchmarks made by METR who was specifically trying to measure AI R&D ability and agency abilities, and which genuinely do seem to require (small) amounts of agency. So I think the balance of evidence is in favor of progress on the dimensions you are talking about—it just hasn’t reached human level yet, or at any rate not the level at which you’d notice big exciting changes in the world. (Analogous to: Suppose we’ve measured COVID in some countries but not others, and found that in every country we’ve measured, COVID has spread to about 0.01% − 0.001% of the population, and is growing exponentially. If we live in a country that hasn’t measured yet, we should assume COVID is spreading even though we don’t know anyone personally who is sick yet.)
...
You say:
Top down vs. bottom-up seem like two different ways of solving intellectual problems. Do you think it’s a sharp binary distinction? Or do you think it’s a spectrum? If the latter, what makes you think o1 isn’t farther along the spectrum than GPT3? If the former—if it’s a sharp binary—can you say what it is about LLM architecture and/or training methods that renders them incapable of thinking in the bottom-up way? (Like, naively it seems like o1 can do sophisticated reasoning. Moreover, it seems like it was trained in a way that would incentivize it to learn skills useful for solving math problems, and ‘bottom-up reasoning’ seems like a skill that would be useful. Why wouldn’t it learn it?)
Can you describe an intellectual or practical feat, or ideally a problem set, such that if AI solves it in 2025 you’ll update significantly towards my position?
Agreed, I do expect that the performance on all of those is mediated by the same variable(s); that’s why I called them a “cluster”.
I think “agency” is a bit of an overly abstract/confusing term to use, here. In particular, I think it also allows both a “top-down” and a “bottom-up” approach.
Humans have “bottom-up” agency: they’re engaging in fluid-intelligence problem-solving and end up “drawing” a decision-making pattern of a specific shape. An LLM, on this model, has a database of templates for such decision-making patterns, and it retrieves the best-fit agency template for whatever problem it’s facing. o1/RL-on-CoTs is a way to deliberately target the set of agency-templates an LLM has, extending it. But it doesn’t change the ultimate nature of what’s happening.
In particular: the bottom-up approach would allow an agent to stay on-target for an arbitrarily long time, creating an arbitrarily precise fit for whatever problem it’s facing. An LLM’s ability to stay on-target, however, would always remain limited by the length and the expressiveness of the templates that were trained into it.
RL on CoTs is a great way to further mask the problem, which is why the o-series seems to make unusual progress on agency-measuring benchmarks. But it’s still just masking it.
Not sure. I think it might be some combination of “the pretraining phase moves the model deep into the local-minimum abyss of top-down cognition, and the cheaper post-training phase can never hope to get it out of there” and “the LLM architecture sucks, actually”. But I would rather not get into the specifics.
“Inventing a new field of science” would do it, as far as more-or-less legible measures go. Anything less than that is too easily “fakeable” using top-down reasoning.
That said, I may make this update based on less legible vibes-based evidence, such as if o3′s advice on real-life problems seems to be unusually lucid and creative. (I’m tracking the possibility that LLMs are steadily growing in general capability and that they simply haven’t yet reached the level that impresses me personally. But on balance, I mostly don’t expect this possibility to be realized.)
Seems unlikely we’ll see this before stuff gets seriously crazy on anyone’s views. (Has any new field of science been invented in the last 5 years by humans? I’m not sure what you’d count.)
It seems like we should at least update towards AIs being very useful for accelerating AI R&D if we very clearly see AI R&D greatly accelerate and it is using tons of AI labor. (And this was the initial top level prompt for this thread.) We could say something similar about other types of research.
Maybe some minor science fields, but yeah entirely new science fields in 5 years is deep into ASI territory, assuming it’s something like a hard science like physics.
Minor would count.
Thanks for the reply.
That possibility is what I believe. I wish we had something to bet on better than “inventing a new field of science,” because by the time we observe that, there probably won’t be much time left to do anything about it. What about e.g. “I, Daniel Kokotajlo, are able to use AI agents basically as substitutes for human engineer/programmer employees. I, as a non-coder, can chat with them and describe ML experiments I want them to run or websites I want them to build etc., and they’ll make it happen at least as quickly and well as a competent professional would.” (And not just for simple websites, for the kind of experiments I’d want to run, which aren’t the most complicated but they aren’t that different from things actual AI company engineers would be doing.)
What about “The model is seemingly as good at solving math problems and puzzles as Thane is, not just on average across many problems but on pretty much any specific problem including on novel ones that are unfamiliar to both of you?
Miscellaneous thoughts: I don’t yet buy that this distinction between top-down and bottom-up is binary, and insofar as it’s a spectrum then I’d be willing to bet that there’s been progress along it in recent years. Moreover I’m not even convinced that this distinction matters much for generalization radius / general intelligence, and it’s even less likely to matter for ‘ability to 5x AI R&D’ which is the milestone I’m trying to predict first. Moreover, I don’t think humans stay on-target for an arbitrarily long time.
I’ve thought of one potential observable that is concrete, should be relatively low-capability, and should provoke a strong update towards your model for me:
If there is an AI model such that the complexity of R&D problems it can solve (1) scales basically boundlessly with the amount of serial compute provided to it (or to a “research fleet” based on it), (2) scales much faster with serial compute than with parallel compute, and (3) the required amount of human attention (“babysitting”) is constant or grows very slowly with the amount of serial compute.
This attempts to directly get at the “autonomous self-correction” and “ability to think about R&D problems strategically” ideas.
I’ve not fully thought through all possible ways reality could Goodhart to this benchmark, i. e. “technically” pass it but in a way I find unconvincing. For example, if I failed to include the condition (2), o3 would have probably already “passed” it (since it potentially achieved better performance on ARC-AGI and FrontierMath by sampling thousands of CoTs then outputting the most frequent answer). There might be other loopholes like this...
But it currently seems reasonable and True-Name-y to me.
Nice.
What about “Daniel Kokotajlo can feed it his docs about some prosaic ML alignment agenda (e.g. the faithful CoT stuff) and then it can autonomously go off and implement the agenda and come back to him with a writeup of the results and takeaways. While working on this, it gets to check in with Daniel once a day for a brief 20-minute chat conversation.”
Does that seem to you like it’ll come earlier, or later, than the milestone you describe?
Prooobably ~simultaneously, but I can maybe see it coming earlier and in a way that isn’t wholly convincing to me. In particular, it would still be a fixed-length task; much longer-length than what the contemporary models can reliably manage today, but still hackable using poorly-generalizing “agency templates” instead of fully general “compact generators of agenty behavior” (which I speculate humans to have and RL’d LLMs not to). It would be some evidence in favor of “AI can accelerate AI R&D”, but not necessarily “LLMs trained via SSL+RL are AGI-complete”.
Actually, I can also see it coming later. For example, some suppose that the capability researchers invent some method for reliably-and-indefinitely extending the amount of serial computations a reasoning model can productively make use of, but the compute or memory requirements grow very fast with the length of a CoT. Some fairly solid empirical evidence and theoretical arguments in favor of boundless scaling can appear quickly, well before the algorithms are made optimal enough to (1) handle weeks-long CoTs and/or (2) allow wide adoption (thus making it available to you).
I think the second scenario is more plausible, actually.
OK. Next question: Suppose that next year we get a nice result showing that there is a model with serial inference-time scaling across e.g. MATH + FrontierMath + IMO problems. Recall that FrontierMath and IMO are subdivided into different difficulty levels; suppose that this model can be given e.g. 10 tokens of CoT, 100, 1000, 10,000, etc. and then somewhere around the billion-serial-token-level it starts solving a decent chunk of the “medium” FrontierMath problems (but not all) and at the million-serial-token level it was only solving MATH + some easy IMO problems.
Would this count, for you?
Not for math benchmarks. Here’s one way it can “cheat” at them: suppose that the CoT would involve the model generating candidate proofs/derivations, then running an internal (learned, not hard-coded) proof verifier on them, and either rejecting the candidate proof and trying to generate a new one, or outputting it. We know that this is possible, since we know that proof verifiers can be compactly specified.
This wouldn’t actually show “agency” and strategic thinking of the kinds that might generalize to open-ended domains and “true” long-horizon tasks. In particular, this would mostly fail the condition (2) from my previous comment.
Something more open-ended and requiring “research taste” would be needed. Maybe a comparable performance on METR’s benchmark would work for this (i. e., the model can beat a significantly larger fraction of it at 1 billion tokens compared to 1 million)? Or some other benchmark that comes closer to evaluating real-world performance.
Edit: Oh, math-benchmark performance would convince me if we get access to a CoT sample and it shows that the model doesn’t follow the above “cheating” approach, but instead approaches the problem strategically (in some sense). (Which would also require this CoT not to be hopelessly steganographied, obviously.)
Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems? I feel like a few examples have been shown & they seem to involve qualitative thinking, not just brute-force-proof-search (though of course they show lots of failed attempts and backtracking—just like a human thought-chain would).
Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI (though I’m not sure)
Certainly (experimenting with r1′s CoTs right now, in fact). I agree that they’re not doing the brute-force stuff I mentioned; that was just me outlining a scenario in which a system “technically” clears the bar you’d outlined, yet I end up unmoved (I don’t want to end up goalpost-moving).
Though neither are they being “strategic” in the way I expect they’d need to be in order to productively use a billion-token CoT.
Yeah, I’m also glad to finally have something concrete-ish to watch out for. Thanks for prompting me!