I mean, to me all this indicates is that our conception of “difficult reasoning problems” is wrong and incorrectly linked to our conception of “intelligence”. Like, it shouldn’t be surprising that the LM can solve problems in text which are notoriously based around applying a short step by step algorithm, when it has many examples in the training set.
To me, this says that “just slightly improving our AI architectures to be less dumb” is incredibly hard, because the models that we would have previously expected to be able to solve trivial arithmetic problems if they could do other “harder” problems are unable to do that.
Like, it shouldn’t be surprising that the LM can solve problems in text which are notoriously based around applying a short step by step algorithm, when it has many examples in the training set.
I’m not clear on why it wouldn’t be surprising. The MATH dataset is not easy stuff for most humans. Yes, it’s clear that the algorithm used in the cases where the language models succeeds must fit in constant time and so must be (in a computational sense) simple, but it’s still outperforming a good chunk of humans. I can’t ignore how odd that is. Perhaps human reasoning is uniquely limited in tasks similar to the MATH dataset, AI consuming it isn’t that interesting, and there are no implications for other types of human reasoning, but that’s a high complexity pill to swallow. I’d need to see some evidence to favor a hypothesis like that.
To me, this says that “just slightly improving our AI architectures to be less dumb” is incredibly hard, because the models that we would have previously expected to be able to solve trivial arithmetic problems if they could do other “harder” problems are unable to do that.
It was easily predictable beforehand that a transformer wouldn’t do well at arithmetic (and all non-constant time algorithms), since transformers provably can’t express it in one shot. Every bit of capability they have above what you’d expect from ‘provably incapable of arithmetic’ is what’s worth at least a little bit of a brow-raise.
Moving to non-constant time architectures provably lifts a fundamental constraint, and is empirically shown to increase capability. (Chain of thought prompting does not entirely remove the limiter on the per-iteration expressible algorithms, but makes it more likely that each step is expressible. It’s a half-step toward a more general architecture, and it works.)
It really isn’t hard. No new paradigms are required. The proof of concepts are already implemented and work. It’s more of a question of when one of the big companies decides it’s worth poking with scale.
I don’t think it’s odd at all—even a terrible chess bot can outplay almost all humans. Because most humans haven’t studied chess. MATH is a dataset of problems from high school competitions, which are well known to require a very limited set of math knowledge and be solveable by applying simple algorithms.
I know chain of thought prompting well—it’s not a way to lift a fundamental constraint, it just is a more efficient targeting of the weights which represent what you want in the model.
It really isn’t hard. No new paradigms are required. The proof of concepts are already implemented and work. It’s more of a question of when one of the big companies decides it’s worth poking with scale.
You don’t provide any proof of this, just speculation, much of it based on massive oversimplifications (if I have time I’ll write up a full rebuttal). For example, RWKV is more of a nice idea that is better for some benchmarks, worse for others, than some kind of new architecture that unlocks greater overall capabilities.
MATH is a dataset of problems from high school competitions, which are well known to require a very limited set of math knowledge and be solveable by applying simple algorithms.
I think you may underestimate the difficulty of the MATH dataset. It’s not IMO-level, obviously, but from the original paper:
We also evaluated humans on MATH, and found that a computer science PhD student who does not especially like mathematics attained approximately 40% on MATH, while a three-time IMO gold medalist attained 90%, indicating that MATH can be challenging for humans as well.
Clearly this is not a rigorous evaluation of human ability, but the dataset is far from trivial. Even if it’s not winning IMO golds yet, this level of capability is not something I would have expected to see managed by an AI that provably cannot multiply in one step (if you had asked me in 2015).
{Edit: to further support that this level of performance on MATH was not obvious, this comes from the original paper:
assuming a log-linear scaling trend, models would need around 10^35 parameters to achieve 40% accuracy on MATH, which is impractical.
Further, I’d again point to the hypermind prediction market for a very glaring case of people thinking 50% in MATH was going to take more time than it actually did. I have a hard time accepting that this level of performance was actually expected without the benefit of hindsight.}
I know chain of thought prompting well—it’s not a way to lift a fundamental constraint, it just is a more efficient targeting of the weights which represent what you want in the model.
It was not targeted at time complexity, but it unavoidably involves it and provides some evidence for its contribution.
You don’t provide any proof of this
I disagree that I’ve offered no evidence- the arguments from complexity are solid, there is empirical research confirming the effect, and CoT points in a compelling direction.
I can understand if you find this part of the argument a bit less compelling. I’m deliberately avoiding details until I’m more confident that it’s safe to talk about. (To be clear, I don’t actually think I’ve got the Secret Keys to Dooming Humanity or something; I’m just trying to be sufficiently paranoid.)
I would recommend making concrete predictions on the 1-10 year timescale about performance on these datasets (and on more difficult datasets).
I mean, to me all this indicates is that our conception of “difficult reasoning problems” is wrong and incorrectly linked to our conception of “intelligence”. Like, it shouldn’t be surprising that the LM can solve problems in text which are notoriously based around applying a short step by step algorithm, when it has many examples in the training set.
To me, this says that “just slightly improving our AI architectures to be less dumb” is incredibly hard, because the models that we would have previously expected to be able to solve trivial arithmetic problems if they could do other “harder” problems are unable to do that.
I’m not clear on why it wouldn’t be surprising. The MATH dataset is not easy stuff for most humans. Yes, it’s clear that the algorithm used in the cases where the language models succeeds must fit in constant time and so must be (in a computational sense) simple, but it’s still outperforming a good chunk of humans. I can’t ignore how odd that is. Perhaps human reasoning is uniquely limited in tasks similar to the MATH dataset, AI consuming it isn’t that interesting, and there are no implications for other types of human reasoning, but that’s a high complexity pill to swallow. I’d need to see some evidence to favor a hypothesis like that.
It was easily predictable beforehand that a transformer wouldn’t do well at arithmetic (and all non-constant time algorithms), since transformers provably can’t express it in one shot. Every bit of capability they have above what you’d expect from ‘provably incapable of arithmetic’ is what’s worth at least a little bit of a brow-raise.
Moving to non-constant time architectures provably lifts a fundamental constraint, and is empirically shown to increase capability. (Chain of thought prompting does not entirely remove the limiter on the per-iteration expressible algorithms, but makes it more likely that each step is expressible. It’s a half-step toward a more general architecture, and it works.)
It really isn’t hard. No new paradigms are required. The proof of concepts are already implemented and work. It’s more of a question of when one of the big companies decides it’s worth poking with scale.
I don’t think it’s odd at all—even a terrible chess bot can outplay almost all humans. Because most humans haven’t studied chess. MATH is a dataset of problems from high school competitions, which are well known to require a very limited set of math knowledge and be solveable by applying simple algorithms.
I know chain of thought prompting well—it’s not a way to lift a fundamental constraint, it just is a more efficient targeting of the weights which represent what you want in the model.
You don’t provide any proof of this, just speculation, much of it based on massive oversimplifications (if I have time I’ll write up a full rebuttal). For example, RWKV is more of a nice idea that is better for some benchmarks, worse for others, than some kind of new architecture that unlocks greater overall capabilities.
I think you may underestimate the difficulty of the MATH dataset. It’s not IMO-level, obviously, but from the original paper:
Clearly this is not a rigorous evaluation of human ability, but the dataset is far from trivial. Even if it’s not winning IMO golds yet, this level of capability is not something I would have expected to see managed by an AI that provably cannot multiply in one step (if you had asked me in 2015).
{Edit: to further support that this level of performance on MATH was not obvious, this comes from the original paper:
Further, I’d again point to the hypermind prediction market for a very glaring case of people thinking 50% in MATH was going to take more time than it actually did. I have a hard time accepting that this level of performance was actually expected without the benefit of hindsight.}
It was not targeted at time complexity, but it unavoidably involves it and provides some evidence for its contribution.
I disagree that I’ve offered no evidence- the arguments from complexity are solid, there is empirical research confirming the effect, and CoT points in a compelling direction.
I can understand if you find this part of the argument a bit less compelling. I’m deliberately avoiding details until I’m more confident that it’s safe to talk about. (To be clear, I don’t actually think I’ve got the Secret Keys to Dooming Humanity or something; I’m just trying to be sufficiently paranoid.)
I would recommend making concrete predictions on the 1-10 year timescale about performance on these datasets (and on more difficult datasets).