I’m going to go against the flow here and not be easily impressed. I suppose it might just be copium.
Any actual reason to expect that the new model beating these challenging benchmarks, which have previously remained unconquered, is any more of a big deal than the last several times a new model beat a bunch of challenging benchmarks that have previously remained unconquered?
Don’t get me wrong, I’m sure it’s amazingly more capable in the domains in which it’s amazingly more capable. But I see quite a lot of “AGI achieved” panicking/exhilaration in various discussions, and I wonder whether it’s more justified this time than the last several times this pattern played out. Does anything indicate that this capability advancement is going to generalize in a meaningful way to real-world tasks and real-world autonomy, rather than remaining limited to the domain of extremely well-posed problems?
One of the reasons I’m skeptical is the part where it requires thousands of dollars’ worth of inference-time compute. Implies it’s doing brute force at extreme scale, which is a strategy that’d only work for, again, domains of well-posed problems with easily verifiable solutions. Similar to how o1 blows Sonnet 3.5.1 out of the water on math, but isn’t much better outside that.
Edit: If we actually look at the benchmarks here:
The most impressive-looking jump is FrontierMath from 2% to 25.2%, but it’s also exactly the benchmark where the strategy of “generate 10k candidate solutions, hook them up to a theorem-verifier, see if one of them checks out, output it” would shine.
(With the potential theorem-verifier having been internalized by o3 over the course of its training; I’m not saying there was a separate theorem-verifier manually wrapped around o3.)
Significant progress on ARC-AGI has previously been achieved using “crude program enumeration”, which made the authors conclude that “about half of the benchmark was not a strong signal towards AGI”.
The SWE jump from 48.9 to 71.7 is significant, but it’s not much of a qualitative improvement.
Not to say it’s a nothingburger, of course. But I’m not feeling the AGI here.
It’s not AGI, but for human labor to retain any long-term value, there has to be an impenetrable wall that AI research hits, and this result rules out a small but nonzero number of locations that wall might’ve been.
To first order, I believe a lot of the reason why the “AGI achieved” shrill posting often tends to be overhyped is that not because the models are theoretically incapable, but rather that reliability was way more of a requirement for it to replace jobs fast than people realized, and there are only a very few jobs where an AI agent can do well without instantly breaking down because it can’t error-correct/be reliable, and I think this has been continually underestimated by AI bulls.
Indeed, one of my broader updates is that a capability is only important to the broader economy if it’s very, very reliable, and I agree with Leo Gao and Alexander Gietelink Oldenziel that reliability is a bottleneck way more than people thought:
To be clear, I do expect AI to accelerate AI research, and AI research may be one of the few exceptions to this rule, but it’s one of the reasons I have longer timelines nowadays than a lot of other people, and also why I expect AI impact on the economy to be surprisingly discontinuous in practice, and is a big reason I expect AI governance have few laws passed until very near the end of the AI as complement era for most jobs that are not AI research.
The post you linked is pretty great, thanks for sharing.
Not to say it’s a nothingburger, of course. But I’m not feeling the AGI here.
These math and coding benchmarks are so narrow that I’m not sure how anybody could treat them as saying anything about “AGI”. LLMs haven’t even tried to be actually general.
How close is “the model” to passing the Woz test (go into a strange house, locate the kitchen, and make a cup of coffee, implicitly without damaging or disrupting things)? If you don’t think the kinesthetic parts of robotics count as part of “intelligence” (and why not?), then could it interactively direct a dumb but dextrous robot to do that?
Can it design a nontrivial, useful physical mechanism that does a novel task effectively and can be built efficiently? Produce usable, physically accurate drawings of it? Actually make it, or at least provide a good enough design that it can have it made? Diagnose problems with it? Improve the design based on observing how the actual device works?
Can it look at somebody else’s mechanical design and form a reasonably reliable opinion about whether it’ll work?
Even in the coding domain, can it build and deploy an entire software stack offering a meaningful service on a real server without assistance?
Can it start an actual business and run it profitably over the long term? Or at least take a good shot at it? Or do anything else that involves integrating multiple domains of competence to flexibly pursue possibly-somewhat-fuzzily-defined goals over a long time in an imperfectly known and changing environment?
Can it learn from experience and mistakes in actual use, without the hobbling training-versus-inference distinction? How quickly and flexibly can it do that?
When it schemes, are its schemes realistically feasible? Can it tell when it’s being conned, and how? Can it recognize an obvious setup like “copy this file to another directory to escape containment”?
Can it successfully persuade people to do specific, relatively complicated things (as opposed to making transparently unworkable hypothetical plans to persuade them)?
It’s not really dangerous real AGI yet. But it will be soon this is a version that’s like a human with severe brain damage to the frontal lobes that provide agency and self-management, and the temporal lobe, which does episodic memory and therefore continuous, self-directed learning.
Those things are relatively easy to add, since it’s smart enough to self-manage as an agent and self-direct its learning. Episodic memory systems exist and only need modest improvements—some low-hanging fruit are glaringly obvious from a computational neuroscience perspective, so I expect them to be employed almost as soon as a competent team starts working on episodic memory.
Don’t indulge in even possible copium. We need your help to align these things, fast. The possibility of dangerous AGI soon can no longer be ignored.
Gambling that the gaps in LLMs abilities (relative to humans) won’t be filled soon is a bad gamble.
A very large amount of human problem solving/innovation in challenging areas is creating and evaluating potential solutions, it is a stochastic rather than deterministic process. My understanding is that our brains are highly parallelized in evaluating ideas in thousands of ‘cortical columns’ a few mm across (Jeff Hawkin’s 1000 brains formulation) with an attention mechanism that promotes the filtered best outputs of those myriad processes forming our ‘consciousness’.
So generating and discarding large numbers of solutions within simpler ‘sub brains’, via iterative, or parallelized operation is very much how I would expect to see AGI and SI develop.
I’m going to go against the flow here and not be easily impressed. I suppose it might just be copium.
Any actual reason to expect that the new model beating these challenging benchmarks, which have previously remained unconquered, is any more of a big deal than the last several times a new model beat a bunch of challenging benchmarks that have previously remained unconquered?
Don’t get me wrong, I’m sure it’s amazingly more capable in the domains in which it’s amazingly more capable. But I see quite a lot of “AGI achieved” panicking/exhilaration in various discussions, and I wonder whether it’s more justified this time than the last several times this pattern played out. Does anything indicate that this capability advancement is going to generalize in a meaningful way to real-world tasks and real-world autonomy, rather than remaining limited to the domain of extremely well-posed problems?
One of the reasons I’m skeptical is the part where it requires thousands of dollars’ worth of inference-time compute. Implies it’s doing brute force at extreme scale, which is a strategy that’d only work for, again, domains of well-posed problems with easily verifiable solutions. Similar to how o1 blows Sonnet 3.5.1 out of the water on math, but isn’t much better outside that.
Edit: If we actually look at the benchmarks here:
The most impressive-looking jump is FrontierMath from 2% to 25.2%, but it’s also exactly the benchmark where the strategy of “generate 10k candidate solutions, hook them up to a theorem-verifier, see if one of them checks out, output it” would shine.
(With the potential theorem-verifier having been internalized by o3 over the course of its training; I’m not saying there was a separate theorem-verifier manually wrapped around o3.)
Significant progress on ARC-AGI has previously been achieved using “crude program enumeration”, which made the authors conclude that “about half of the benchmark was not a strong signal towards AGI”.
The SWE jump from 48.9 to 71.7 is significant, but it’s not much of a qualitative improvement.
Not to say it’s a nothingburger, of course. But I’m not feeling the AGI here.
It’s not AGI, but for human labor to retain any long-term value, there has to be an impenetrable wall that AI research hits, and this result rules out a small but nonzero number of locations that wall might’ve been.
To first order, I believe a lot of the reason why the “AGI achieved” shrill posting often tends to be overhyped is that not because the models are theoretically incapable, but rather that reliability was way more of a requirement for it to replace jobs fast than people realized, and there are only a very few jobs where an AI agent can do well without instantly breaking down because it can’t error-correct/be reliable, and I think this has been continually underestimated by AI bulls.
Indeed, one of my broader updates is that a capability is only important to the broader economy if it’s very, very reliable, and I agree with Leo Gao and Alexander Gietelink Oldenziel that reliability is a bottleneck way more than people thought:
https://www.lesswrong.com/posts/YiRsCfkJ2ERGpRpen/leogao-s-shortform#f5WAxD3WfjQgefeZz
https://www.lesswrong.com/posts/YiRsCfkJ2ERGpRpen/leogao-s-shortform#YxLCWZ9ZfhPdjojnv
I agree that this seems like an important factor. See also this post making a similar point.
To be clear, I do expect AI to accelerate AI research, and AI research may be one of the few exceptions to this rule, but it’s one of the reasons I have longer timelines nowadays than a lot of other people, and also why I expect AI impact on the economy to be surprisingly discontinuous in practice, and is a big reason I expect AI governance have few laws passed until very near the end of the AI as complement era for most jobs that are not AI research.
The post you linked is pretty great, thanks for sharing.
These math and coding benchmarks are so narrow that I’m not sure how anybody could treat them as saying anything about “AGI”. LLMs haven’t even tried to be actually general.
How close is “the model” to passing the Woz test (go into a strange house, locate the kitchen, and make a cup of coffee, implicitly without damaging or disrupting things)? If you don’t think the kinesthetic parts of robotics count as part of “intelligence” (and why not?), then could it interactively direct a dumb but dextrous robot to do that?
Can it design a nontrivial, useful physical mechanism that does a novel task effectively and can be built efficiently? Produce usable, physically accurate drawings of it? Actually make it, or at least provide a good enough design that it can have it made? Diagnose problems with it? Improve the design based on observing how the actual device works?
Can it look at somebody else’s mechanical design and form a reasonably reliable opinion about whether it’ll work?
Even in the coding domain, can it build and deploy an entire software stack offering a meaningful service on a real server without assistance?
Can it start an actual business and run it profitably over the long term? Or at least take a good shot at it? Or do anything else that involves integrating multiple domains of competence to flexibly pursue possibly-somewhat-fuzzily-defined goals over a long time in an imperfectly known and changing environment?
Can it learn from experience and mistakes in actual use, without the hobbling training-versus-inference distinction? How quickly and flexibly can it do that?
When it schemes, are its schemes realistically feasible? Can it tell when it’s being conned, and how? Can it recognize an obvious setup like “copy this file to another directory to escape containment”?
Can it successfully persuade people to do specific, relatively complicated things (as opposed to making transparently unworkable hypothetical plans to persuade them)?
It’s not really dangerous real AGI yet. But it will be soon this is a version that’s like a human with severe brain damage to the frontal lobes that provide agency and self-management, and the temporal lobe, which does episodic memory and therefore continuous, self-directed learning.
Those things are relatively easy to add, since it’s smart enough to self-manage as an agent and self-direct its learning. Episodic memory systems exist and only need modest improvements—some low-hanging fruit are glaringly obvious from a computational neuroscience perspective, so I expect them to be employed almost as soon as a competent team starts working on episodic memory.
Don’t indulge in even possible copium. We need your help to align these things, fast. The possibility of dangerous AGI soon can no longer be ignored.
Gambling that the gaps in LLMs abilities (relative to humans) won’t be filled soon is a bad gamble.
A very large amount of human problem solving/innovation in challenging areas is creating and evaluating potential solutions, it is a stochastic rather than deterministic process. My understanding is that our brains are highly parallelized in evaluating ideas in thousands of ‘cortical columns’ a few mm across (Jeff Hawkin’s 1000 brains formulation) with an attention mechanism that promotes the filtered best outputs of those myriad processes forming our ‘consciousness’.
So generating and discarding large numbers of solutions within simpler ‘sub brains’, via iterative, or parallelized operation is very much how I would expect to see AGI and SI develop.