Capabilities progress is continuing without slowing.
I disagree. For reasonable ways to interpret this statement, capabilities progress has slowed. Consider the timeline:
2018: GPT-1 paper
2019: GPT-2 release
2020: GPT-3 release
2023: GPT-4 release
Notice the 1 year gap from GPT-2 to GPT-3 and the 3 year gap from GPT-3 to GPT-4. If capabilities progress had not slowed, the latter capabilities improvement should be ~3x the former.
It’s tricky because different ways to interpret the statement can give different answers. Even if we restrict ourselves to metrics that are monotone transformations of each other, such transformations don’t generally preserve derivatives.
Your example is good. As an additional example, if someone were particularly interested in the Uniform Bar Exam (where GPT-3.5 scores 10th percentile and GPT-4 scores 90th percentile), they would justifiably perceive an acceleration in capabilities.
So ultimately the measurement is always going to involve at least a subjective choice of which metric to choose.
Suppose the current generation, gpt 4, is not quite good enough at designing improved AIs to be worth spending finite money supplying it with computational resources. (So in this example, gpt-4 is dumb enough hypothetically it would need 5 billion in compute to find a gpt-5, while open AI could pay humans and buy a smaller amount of hardware and find it with 2 billion)
But gpt-5 needs just 2 billion to find gpt-6, while openAI needs 3 billion to do it with humans. (Because 6 is harder than 5 and so on)
Gpt-6 has enough working memory and talent it finds 7 with 1 billion...
And so on until gpt-n is throttled by already being too effective at using all the compute it is supplied that it would be a waste of effort to have it spend compute on n+1 development when it could just do tasks to pay for more compute or to pay for robots to collect new scientific data it can then train on.
I call the process “find” because it’s searching a vast possibility space of choices you make at each layer of the system.
Same thing goes for self replicating robots. If Robots are too dumb, they won’t make enough new robot parts (or economic value gain since at least at first these things will operate in the human economy) to pay for another copy of 1 robot on average before the robot wears out or screws up enough to wreck itself.
Each case above a small increase in intelligence could go from “process damps to zero” to “process gains exponentially”
The former gives you a system that fails 20 percent of the time still, the latter halves your error rate.
The former results in the error rate being 2050=40% of the previous one, while the latter in it being 1020=50%, so the former would appear to be a bigger step?
you’re right. I was latched on the fact that with the former case, you still have to babysit a lot, because 1⁄5 times is a lot of errors, while 1⁄10 is starting to approach viability for some tasks.
I object to such a [change in metric]/[change in time] calculation, in which case I’m still at fault for my phrasing using the terminology of speed. Maybe I should have said “is continuing without hitting a wall”.
My main objection, as described by yourself in other comments, is that the choice of metric matters a great deal. In particular, even if log(KL divergence) continues (sub)linearly, the metrics we actually care about, like “is it smarter than a human” or “how much economic activity can be done by this AI” may be a nonlinear function of log(KL divergence) and may not be slowing down.
I think if I’m honest with myself, I made that statement based on the very non-rigorous metric “how many years do I feel like we have left until AGI”, and my estimate of that has continued to decrease rapidly.
Maybe I should have said “is continuing without hitting a wall”.
I like that way of putting it. I definitely agree that performance hasn’t plateaued yet, which is notable, and that claim doesn’t depend much on metric.
I think if I’m honest with myself, I made that statement based on the very non-rigorous metric “how many years do I feel like we have left until AGI”, and my estimate of that has continued to decrease rapidly.
Interesting, so that way of looking at it is essentially “did it outperform or underperform expectations”. For me, after the yearly progression in 2019 and 2020, I was surprised that GPT-4 didn’t come out in 2021, so in that sense it underperformed my expectations. But it’s pretty close to what I expected in the days before release (informed by Barnett’s thread). I suppose the exception is the multi-modality, although I’m not sure what to make of it since it’s not available to me yet.
This got me curious how it impacted Metaculus. I looked at some selected problems and tried my best to read the before/after from the graph.
(Edit: The original version of this table typoed the dates for “turing test”. Edit 2: The color-coding for the percentage is flipped, but I can’t be bothered to fix it.)
The lack of GPT-4 in 2020-mid-2021 wasn’t too surprising to me. They were busy productizing, optimizing, launching, and had no genuine competition. Everyone with a plausibly competitive model was not releasing it, and the ones which were available were not convincingly better. Why invest or release? Jurassic-1 in July 2021 was the first public API, but I never heard anyone call it noticeably better than davinci. Tick-tock...
What I find a little more curious is no successor in 2021-2022, and that it wasn’t until August 2022 that GPT-4 finished training, with what sounds like about 6 months of training, so it hadn’t even started until like February 2022. This is a bit odd. The vibe I had been getting, particularly from the Altman ACX meetup, was that GPT-4 was considerably firmer than ‘we’ll start training GPT-4 for real in, idk, a year or whenever we get around to it, it’s nbd’. Particularly with so much going on in scaling in general.
One working hypothesis I had was that they were planning on something much more impressive than GPT-4 is (or at least, appears to be), but that doesn’t seem to be the case. GPT-4, as described, looks more or less like, ‘what if the scaling hypothesis was true and then DALL-E 1 but way bigger and mostly just text?’ Or to put it another way, what we see looks an awful lot like what you might’ve predicted in May 2020 as the immediate obvious followup, not what you might’ve predicted in February 2022 as the followup. That is, GPT-3 would’ve been finalized around like June 2019 & halted around Dec 2019 IIRC, and GPT-4 around February 2022 - but it just doesn’t look like 3 years of advancement, by that many researchers I respect that much while just the known published results elsewhere are so amazing. (Yes, Whisper, CLIP, DALL-E 2, GPT-f etc, but not sure that’s enough.) So I definitely feel like I am missing something in my understanding, but I’m unsure if it’s some major advance hiding inside their pointed omission of all details or if there was some sort of major R&D mishap where they invested a lot of effort into a failed MoE approach, or what.
• Images are first transformed into sequences of non-overlapping 16 × 16 patches in raster order, as done in ViT (Dosovitskiy et al., 2020). Each pixel in the image patches is then normalized between [−1, 1] and divided by the square-root of the patch size (i.e. √ 16 = 4).
There are many easy ways to incorporate vision. Vision+text models are a dime a dozen these days—as I said, this currently looks like ‘DALL-E 1 but bigger’ (VQVAE tokens → token sequence → autoregressive modeling of text/image tokens). What we have seen so far doesn’t look like 3 years of progress by the best DL researchers.
OpenAI has transitioned from being a purely research company to an engineering one. GPT-3 was still research after all, and it was trained a relatively small amount of compute. After that, they had to build infrastructure to serve the models via API and a new supercomputing infrastructure to train new models with 100x compute of GPT-3 in an efficient way.
The fact that we are openly hearing rumours of GPT-5 being trained and nobody is denying them, it means that it is likely that they will ship a new version every year or so from now on.
Earlier this month PALM-E gives a hint of one way to incorporate vision into LLMs (statement, paper) though obviously its a different company so GPT-4 might have taken a different approach. Choice quote from the paper:
Inputs such as images and state estimates are embedded into the same latent embedding as language tokens and processed by the self-attention layers of a Transformer-based LLM in the same way as text
For me, after the yearly progression in 2019 and 2020, I was surprised that GPT-4 didn’t come out in 2021
Bit of an aside, but even though obviously coding is one of the jobs that was less affected, I would say that we should take into account that the unusual circumstances from 2020 onward might have impacted the speed of development of any ongoing projects at the time. It might not be fair to make a straight comparison. COVID froze or slowed down plenty of things, especially in mid to late 2020.
Thanks for compiling the Metaculus predictions! Seems like on 4⁄6 the community updated their timelines to be sooner. Also notable that Matthew Barnett just conceded a short timelines bet early! He says he actually updated his timelines a few months ago, partially due to ChatGPT.
I disagree. For reasonable ways to interpret this statement, capabilities progress has slowed. Consider the timeline:
2018: GPT-1 paper
2019: GPT-2 release
2020: GPT-3 release
2023: GPT-4 release
Notice the 1 year gap from GPT-2 to GPT-3 and the 3 year gap from GPT-3 to GPT-4. If capabilities progress had not slowed, the latter capabilities improvement should be ~3x the former.
How do those capability steps actually compare? It’s hard to say with the available information. In December 2022, Matthew Barnett estimated that the 3->4 improvement would be about as large as the 2->3 improvement. Unfortunately, there’s not enough information to say whether that prediction was correct. However, my subjective impression is that they are of comparable size or even the 3->4 step is smaller.
If we do accept that the 3->4 step is about as big as the 2->3 step, that means that progress went ~33% as fast from 3 to 4 as it did from 2 to 3.
How would you measure this more objectively?
What bugs me is that in terms of utility, the step from 50 percent accuracy to 80 percent is smaller than the step from 80 percent to 90 percent.
The former gives you a system that fails 20 percent of the time still, the latter halves your error rate.
The 90 to 95 percent accuracy is an even larger utility gain—half the babysitting, system is good enough for lower stakes jobs.
And so on with halvings, where 99 to 99.5 percent is a larger step than all the prior ones.
It’s tricky because different ways to interpret the statement can give different answers. Even if we restrict ourselves to metrics that are monotone transformations of each other, such transformations don’t generally preserve derivatives.
Your example is good. As an additional example, if someone were particularly interested in the Uniform Bar Exam (where GPT-3.5 scores 10th percentile and GPT-4 scores 90th percentile), they would justifiably perceive an acceleration in capabilities.
So ultimately the measurement is always going to involve at least a subjective choice of which metric to choose.
Right. Or what really matters, criticality gain.
Suppose the current generation, gpt 4, is not quite good enough at designing improved AIs to be worth spending finite money supplying it with computational resources. (So in this example, gpt-4 is dumb enough hypothetically it would need 5 billion in compute to find a gpt-5, while open AI could pay humans and buy a smaller amount of hardware and find it with 2 billion)
But gpt-5 needs just 2 billion to find gpt-6, while openAI needs 3 billion to do it with humans. (Because 6 is harder than 5 and so on)
Gpt-6 has enough working memory and talent it finds 7 with 1 billion...
And so on until gpt-n is throttled by already being too effective at using all the compute it is supplied that it would be a waste of effort to have it spend compute on n+1 development when it could just do tasks to pay for more compute or to pay for robots to collect new scientific data it can then train on.
I call the process “find” because it’s searching a vast possibility space of choices you make at each layer of the system.
Same thing goes for self replicating robots. If Robots are too dumb, they won’t make enough new robot parts (or economic value gain since at least at first these things will operate in the human economy) to pay for another copy of 1 robot on average before the robot wears out or screws up enough to wreck itself.
Each case above a small increase in intelligence could go from “process damps to zero” to “process gains exponentially”
The former results in the error rate being 2050=40% of the previous one, while the latter in it being 1020=50%, so the former would appear to be a bigger step?
you’re right. I was latched on the fact that with the former case, you still have to babysit a lot, because 1⁄5 times is a lot of errors, while 1⁄10 is starting to approach viability for some tasks.
I object to such a [change in metric]/[change in time] calculation, in which case I’m still at fault for my phrasing using the terminology of speed. Maybe I should have said “is continuing without hitting a wall”.
My main objection, as described by yourself in other comments, is that the choice of metric matters a great deal. In particular, even if log(KL divergence) continues (sub)linearly, the metrics we actually care about, like “is it smarter than a human” or “how much economic activity can be done by this AI” may be a nonlinear function of log(KL divergence) and may not be slowing down.
I think if I’m honest with myself, I made that statement based on the very non-rigorous metric “how many years do I feel like we have left until AGI”, and my estimate of that has continued to decrease rapidly.
I like that way of putting it. I definitely agree that performance hasn’t plateaued yet, which is notable, and that claim doesn’t depend much on metric.
Interesting, so that way of looking at it is essentially “did it outperform or underperform expectations”. For me, after the yearly progression in 2019 and 2020, I was surprised that GPT-4 didn’t come out in 2021, so in that sense it underperformed my expectations. But it’s pretty close to what I expected in the days before release (informed by Barnett’s thread). I suppose the exception is the multi-modality, although I’m not sure what to make of it since it’s not available to me yet.
This got me curious how it impacted Metaculus. I looked at some selected problems and tried my best to read the before/after from the graph.
(Edit: The original version of this table typoed the dates for “turing test”. Edit 2: The color-coding for the percentage is flipped, but I can’t be bothered to fix it.)
The lack of GPT-4 in 2020-mid-2021 wasn’t too surprising to me. They were busy productizing, optimizing, launching, and had no genuine competition. Everyone with a plausibly competitive model was not releasing it, and the ones which were available were not convincingly better. Why invest or release? Jurassic-1 in July 2021 was the first public API, but I never heard anyone call it noticeably better than davinci. Tick-tock...
What I find a little more curious is no successor in 2021-2022, and that it wasn’t until August 2022 that GPT-4 finished training, with what sounds like about 6 months of training, so it hadn’t even started until like February 2022. This is a bit odd. The vibe I had been getting, particularly from the Altman ACX meetup, was that GPT-4 was considerably firmer than ‘we’ll start training GPT-4 for real in, idk, a year or whenever we get around to it, it’s nbd’. Particularly with so much going on in scaling in general.
One working hypothesis I had was that they were planning on something much more impressive than GPT-4 is (or at least, appears to be), but that doesn’t seem to be the case. GPT-4, as described, looks more or less like, ‘what if the scaling hypothesis was true and then DALL-E 1 but way bigger and mostly just text?’ Or to put it another way, what we see looks an awful lot like what you might’ve predicted in May 2020 as the immediate obvious followup, not what you might’ve predicted in February 2022 as the followup. That is, GPT-3 would’ve been finalized around like June 2019 & halted around Dec 2019 IIRC, and GPT-4 around February 2022 - but it just doesn’t look like 3 years of advancement, by that many researchers I respect that much while just the known published results elsewhere are so amazing. (Yes, Whisper, CLIP, DALL-E 2, GPT-f etc, but not sure that’s enough.) So I definitely feel like I am missing something in my understanding, but I’m unsure if it’s some major advance hiding inside their pointed omission of all details or if there was some sort of major R&D mishap where they invested a lot of effort into a failed MoE approach, or what.
Umm...the vision? How did they even train it?
Assuming they did it like Gato:
• Images are first transformed into sequences of non-overlapping 16 × 16 patches in raster order, as done in ViT (Dosovitskiy et al., 2020). Each pixel in the image patches is then normalized between [−1, 1] and divided by the square-root of the patch size (i.e. √ 16 = 4).
There are many easy ways to incorporate vision. Vision+text models are a dime a dozen these days—as I said, this currently looks like ‘DALL-E 1 but bigger’ (VQVAE tokens → token sequence → autoregressive modeling of text/image tokens). What we have seen so far doesn’t look like 3 years of progress by the best DL researchers.
OpenAI has transitioned from being a purely research company to an engineering one. GPT-3 was still research after all, and it was trained a relatively small amount of compute. After that, they had to build infrastructure to serve the models via API and a new supercomputing infrastructure to train new models with 100x compute of GPT-3 in an efficient way.
The fact that we are openly hearing rumours of GPT-5 being trained and nobody is denying them, it means that it is likely that they will ship a new version every year or so from now on.
Earlier this month PALM-E gives a hint of one way to incorporate vision into LLMs (statement, paper) though obviously its a different company so GPT-4 might have taken a different approach. Choice quote from the paper:
Bit of an aside, but even though obviously coding is one of the jobs that was less affected, I would say that we should take into account that the unusual circumstances from 2020 onward might have impacted the speed of development of any ongoing projects at the time. It might not be fair to make a straight comparison. COVID froze or slowed down plenty of things, especially in mid to late 2020.
Thanks for compiling the Metaculus predictions! Seems like on 4⁄6 the community updated their timelines to be sooner. Also notable that Matthew Barnett just conceded a short timelines bet early! He says he actually updated his timelines a few months ago, partially due to ChatGPT.