Maybe I should have said “is continuing without hitting a wall”.
I like that way of putting it. I definitely agree that performance hasn’t plateaued yet, which is notable, and that claim doesn’t depend much on metric.
I think if I’m honest with myself, I made that statement based on the very non-rigorous metric “how many years do I feel like we have left until AGI”, and my estimate of that has continued to decrease rapidly.
Interesting, so that way of looking at it is essentially “did it outperform or underperform expectations”. For me, after the yearly progression in 2019 and 2020, I was surprised that GPT-4 didn’t come out in 2021, so in that sense it underperformed my expectations. But it’s pretty close to what I expected in the days before release (informed by Barnett’s thread). I suppose the exception is the multi-modality, although I’m not sure what to make of it since it’s not available to me yet.
This got me curious how it impacted Metaculus. I looked at some selected problems and tried my best to read the before/after from the graph.
(Edit: The original version of this table typoed the dates for “turing test”. Edit 2: The color-coding for the percentage is flipped, but I can’t be bothered to fix it.)
The lack of GPT-4 in 2020-mid-2021 wasn’t too surprising to me. They were busy productizing, optimizing, launching, and had no genuine competition. Everyone with a plausibly competitive model was not releasing it, and the ones which were available were not convincingly better. Why invest or release? Jurassic-1 in July 2021 was the first public API, but I never heard anyone call it noticeably better than davinci. Tick-tock...
What I find a little more curious is no successor in 2021-2022, and that it wasn’t until August 2022 that GPT-4 finished training, with what sounds like about 6 months of training, so it hadn’t even started until like February 2022. This is a bit odd. The vibe I had been getting, particularly from the Altman ACX meetup, was that GPT-4 was considerably firmer than ‘we’ll start training GPT-4 for real in, idk, a year or whenever we get around to it, it’s nbd’. Particularly with so much going on in scaling in general.
One working hypothesis I had was that they were planning on something much more impressive than GPT-4 is (or at least, appears to be), but that doesn’t seem to be the case. GPT-4, as described, looks more or less like, ‘what if the scaling hypothesis was true and then DALL-E 1 but way bigger and mostly just text?’ Or to put it another way, what we see looks an awful lot like what you might’ve predicted in May 2020 as the immediate obvious followup, not what you might’ve predicted in February 2022 as the followup. That is, GPT-3 would’ve been finalized around like June 2019 & halted around Dec 2019 IIRC, and GPT-4 around February 2022 - but it just doesn’t look like 3 years of advancement, by that many researchers I respect that much while just the known published results elsewhere are so amazing. (Yes, Whisper, CLIP, DALL-E 2, GPT-f etc, but not sure that’s enough.) So I definitely feel like I am missing something in my understanding, but I’m unsure if it’s some major advance hiding inside their pointed omission of all details or if there was some sort of major R&D mishap where they invested a lot of effort into a failed MoE approach, or what.
• Images are first transformed into sequences of non-overlapping 16 × 16 patches in raster order, as done in ViT (Dosovitskiy et al., 2020). Each pixel in the image patches is then normalized between [−1, 1] and divided by the square-root of the patch size (i.e. √ 16 = 4).
There are many easy ways to incorporate vision. Vision+text models are a dime a dozen these days—as I said, this currently looks like ‘DALL-E 1 but bigger’ (VQVAE tokens → token sequence → autoregressive modeling of text/image tokens). What we have seen so far doesn’t look like 3 years of progress by the best DL researchers.
OpenAI has transitioned from being a purely research company to an engineering one. GPT-3 was still research after all, and it was trained a relatively small amount of compute. After that, they had to build infrastructure to serve the models via API and a new supercomputing infrastructure to train new models with 100x compute of GPT-3 in an efficient way.
The fact that we are openly hearing rumours of GPT-5 being trained and nobody is denying them, it means that it is likely that they will ship a new version every year or so from now on.
Earlier this month PALM-E gives a hint of one way to incorporate vision into LLMs (statement, paper) though obviously its a different company so GPT-4 might have taken a different approach. Choice quote from the paper:
Inputs such as images and state estimates are embedded into the same latent embedding as language tokens and processed by the self-attention layers of a Transformer-based LLM in the same way as text
For me, after the yearly progression in 2019 and 2020, I was surprised that GPT-4 didn’t come out in 2021
Bit of an aside, but even though obviously coding is one of the jobs that was less affected, I would say that we should take into account that the unusual circumstances from 2020 onward might have impacted the speed of development of any ongoing projects at the time. It might not be fair to make a straight comparison. COVID froze or slowed down plenty of things, especially in mid to late 2020.
Thanks for compiling the Metaculus predictions! Seems like on 4⁄6 the community updated their timelines to be sooner. Also notable that Matthew Barnett just conceded a short timelines bet early! He says he actually updated his timelines a few months ago, partially due to ChatGPT.
I like that way of putting it. I definitely agree that performance hasn’t plateaued yet, which is notable, and that claim doesn’t depend much on metric.
Interesting, so that way of looking at it is essentially “did it outperform or underperform expectations”. For me, after the yearly progression in 2019 and 2020, I was surprised that GPT-4 didn’t come out in 2021, so in that sense it underperformed my expectations. But it’s pretty close to what I expected in the days before release (informed by Barnett’s thread). I suppose the exception is the multi-modality, although I’m not sure what to make of it since it’s not available to me yet.
This got me curious how it impacted Metaculus. I looked at some selected problems and tried my best to read the before/after from the graph.
(Edit: The original version of this table typoed the dates for “turing test”. Edit 2: The color-coding for the percentage is flipped, but I can’t be bothered to fix it.)
The lack of GPT-4 in 2020-mid-2021 wasn’t too surprising to me. They were busy productizing, optimizing, launching, and had no genuine competition. Everyone with a plausibly competitive model was not releasing it, and the ones which were available were not convincingly better. Why invest or release? Jurassic-1 in July 2021 was the first public API, but I never heard anyone call it noticeably better than davinci. Tick-tock...
What I find a little more curious is no successor in 2021-2022, and that it wasn’t until August 2022 that GPT-4 finished training, with what sounds like about 6 months of training, so it hadn’t even started until like February 2022. This is a bit odd. The vibe I had been getting, particularly from the Altman ACX meetup, was that GPT-4 was considerably firmer than ‘we’ll start training GPT-4 for real in, idk, a year or whenever we get around to it, it’s nbd’. Particularly with so much going on in scaling in general.
One working hypothesis I had was that they were planning on something much more impressive than GPT-4 is (or at least, appears to be), but that doesn’t seem to be the case. GPT-4, as described, looks more or less like, ‘what if the scaling hypothesis was true and then DALL-E 1 but way bigger and mostly just text?’ Or to put it another way, what we see looks an awful lot like what you might’ve predicted in May 2020 as the immediate obvious followup, not what you might’ve predicted in February 2022 as the followup. That is, GPT-3 would’ve been finalized around like June 2019 & halted around Dec 2019 IIRC, and GPT-4 around February 2022 - but it just doesn’t look like 3 years of advancement, by that many researchers I respect that much while just the known published results elsewhere are so amazing. (Yes, Whisper, CLIP, DALL-E 2, GPT-f etc, but not sure that’s enough.) So I definitely feel like I am missing something in my understanding, but I’m unsure if it’s some major advance hiding inside their pointed omission of all details or if there was some sort of major R&D mishap where they invested a lot of effort into a failed MoE approach, or what.
Umm...the vision? How did they even train it?
Assuming they did it like Gato:
• Images are first transformed into sequences of non-overlapping 16 × 16 patches in raster order, as done in ViT (Dosovitskiy et al., 2020). Each pixel in the image patches is then normalized between [−1, 1] and divided by the square-root of the patch size (i.e. √ 16 = 4).
There are many easy ways to incorporate vision. Vision+text models are a dime a dozen these days—as I said, this currently looks like ‘DALL-E 1 but bigger’ (VQVAE tokens → token sequence → autoregressive modeling of text/image tokens). What we have seen so far doesn’t look like 3 years of progress by the best DL researchers.
OpenAI has transitioned from being a purely research company to an engineering one. GPT-3 was still research after all, and it was trained a relatively small amount of compute. After that, they had to build infrastructure to serve the models via API and a new supercomputing infrastructure to train new models with 100x compute of GPT-3 in an efficient way.
The fact that we are openly hearing rumours of GPT-5 being trained and nobody is denying them, it means that it is likely that they will ship a new version every year or so from now on.
Earlier this month PALM-E gives a hint of one way to incorporate vision into LLMs (statement, paper) though obviously its a different company so GPT-4 might have taken a different approach. Choice quote from the paper:
Bit of an aside, but even though obviously coding is one of the jobs that was less affected, I would say that we should take into account that the unusual circumstances from 2020 onward might have impacted the speed of development of any ongoing projects at the time. It might not be fair to make a straight comparison. COVID froze or slowed down plenty of things, especially in mid to late 2020.
Thanks for compiling the Metaculus predictions! Seems like on 4⁄6 the community updated their timelines to be sooner. Also notable that Matthew Barnett just conceded a short timelines bet early! He says he actually updated his timelines a few months ago, partially due to ChatGPT.