MATH is a dataset of problems from high school competitions, which are well known to require a very limited set of math knowledge and be solveable by applying simple algorithms.
I think you may underestimate the difficulty of the MATH dataset. It’s not IMO-level, obviously, but from the original paper:
We also evaluated humans on MATH, and found that a computer science PhD student who does not especially like mathematics attained approximately 40% on MATH, while a three-time IMO gold medalist attained 90%, indicating that MATH can be challenging for humans as well.
Clearly this is not a rigorous evaluation of human ability, but the dataset is far from trivial. Even if it’s not winning IMO golds yet, this level of capability is not something I would have expected to see managed by an AI that provably cannot multiply in one step (if you had asked me in 2015).
{Edit: to further support that this level of performance on MATH was not obvious, this comes from the original paper:
assuming a log-linear scaling trend, models would need around 10^35 parameters to achieve 40% accuracy on MATH, which is impractical.
Further, I’d again point to the hypermind prediction market for a very glaring case of people thinking 50% in MATH was going to take more time than it actually did. I have a hard time accepting that this level of performance was actually expected without the benefit of hindsight.}
I know chain of thought prompting well—it’s not a way to lift a fundamental constraint, it just is a more efficient targeting of the weights which represent what you want in the model.
It was not targeted at time complexity, but it unavoidably involves it and provides some evidence for its contribution.
You don’t provide any proof of this
I disagree that I’ve offered no evidence- the arguments from complexity are solid, there is empirical research confirming the effect, and CoT points in a compelling direction.
I can understand if you find this part of the argument a bit less compelling. I’m deliberately avoiding details until I’m more confident that it’s safe to talk about. (To be clear, I don’t actually think I’ve got the Secret Keys to Dooming Humanity or something; I’m just trying to be sufficiently paranoid.)
I would recommend making concrete predictions on the 1-10 year timescale about performance on these datasets (and on more difficult datasets).
I think you may underestimate the difficulty of the MATH dataset. It’s not IMO-level, obviously, but from the original paper:
Clearly this is not a rigorous evaluation of human ability, but the dataset is far from trivial. Even if it’s not winning IMO golds yet, this level of capability is not something I would have expected to see managed by an AI that provably cannot multiply in one step (if you had asked me in 2015).
{Edit: to further support that this level of performance on MATH was not obvious, this comes from the original paper:
Further, I’d again point to the hypermind prediction market for a very glaring case of people thinking 50% in MATH was going to take more time than it actually did. I have a hard time accepting that this level of performance was actually expected without the benefit of hindsight.}
It was not targeted at time complexity, but it unavoidably involves it and provides some evidence for its contribution.
I disagree that I’ve offered no evidence- the arguments from complexity are solid, there is empirical research confirming the effect, and CoT points in a compelling direction.
I can understand if you find this part of the argument a bit less compelling. I’m deliberately avoiding details until I’m more confident that it’s safe to talk about. (To be clear, I don’t actually think I’ve got the Secret Keys to Dooming Humanity or something; I’m just trying to be sufficiently paranoid.)
I would recommend making concrete predictions on the 1-10 year timescale about performance on these datasets (and on more difficult datasets).