your view seems to imply that we will move quickly from much worse than humans to much better than humans, but it’s likely that we will move slowly through the human range on many tasks
We might be able to falsify that in a few months.
There is a joint Google / OpenAI project called BIG-bench. They’ve crowdsourced ~200 of highly diverse text tasks (from answering scientific questions to predicting protein interacting sites to measuring self-awareness).
One of the goals of the project is to see how the performance on the tasks is changing with the model size, with the size ranging by many orders of magnitude.
A half-year ago, they presented some preliminary results. A quick summary:
if you increase the N of parameters from 10^7 to 10^10, the aggregate performance score grows roughly like log(N).
But after the 10^10 point, something interesting happens: the score starts growing much faster (~N).
And for some tasks, the plot looks like a hockey stick (a sudden change from ~0 to almost-human).
The paper with the full results is expected to be published in the next few months.
Judging by the preliminary results, the FOOM could start like this:
The GPT-5 still sucks on most tasks. It’s mostly useless. But what if we increase parameters_num by 2? What could possibly go wrong?
The results were presented at a workshop by the project organizers. The video from the workshop is available here (the most relevant presentation starts at 5:05:00).
It’s one of those innocent presentations that, after you understand the implications, keep you awake at night.
Presumably you’re referring to this graph. The y-axis looks like the kind of score that ranges between 0 and 1, in which case this looks sort-of like a sigmoid to me, which accelerates when it gets closer to ~50% performance (and decelarates when it gets closer to 100% performance).
If so, we might want to ask whether these tasks are chosen ~randomly (among tasks that are indicative of how useful AI is) or if they’re selected for difficulty in some way. In particular, assume that most tasks look sort-of like a sigmoid as they’re scaled up (accelerating around 50%, improving slower when they’re closer to 0% and 100%). Then you might think that the most exciting tasks to submit to big bench would be the tasks that can’t be handled by small models, but that large models rapidly improve upon (as opposed to tasks that are basically-solved already by 10^10 parameters). In which case the aggregation of all these tasks could be expected to look sort-of like this, improving faster after 10^10 than before.
...is one story I can tell, but idk if I would have predicted that beforehand, and fast acceleration after 10^10 is certainly consistent with many people’s qualitative impressions of GPT-3. So maybe there is some real acceleration going on.
(Also, see this post for similar curves, but for the benchmarks that OpenAI tested GPT-3 on. There’s no real acceleration visible there, other than for arithmetic.)
The preliminary results where obtained on a subset of the full benchmark (~90 tasks vs 206 tasks). And there were many changes since then, including scoring changes. Thus, I’m not sure we’ll see the same dynamics in the final results. Most likely yes, but maybe not.
I agree that the task selection process could create the dynamics that look like the acceleration. A good point.
As I understand, the organizers have accepted almost all submitted tasks (the main rejection reasons were technical—copyright etc). So, it was mostly self-selection, with the bias towards the hardest imaginable text tasks. It seems that for many contributors, the main motivation was something like:
Take that, the most advanced AI of Google! Let’s see if you can handle my epic task!
This includes many cognitive tasks that are supposedly human-complete (e.g. understanding of humor, irony, ethics), and the tasks that are probing the model’s generality (e.g. playing chess, recognizing images, navigating mazes—all in text).
I wonder if the performance dynamics on such tasks will follow the same curve.
For Sanh et al. (2021), we were able to negotiate access to preliminary numbers from the BIG Bench project and run the T0 models on it. However the authors of Sanh et al. and the authors of BIG Bench are different groups of people.
But after the 10^10 point, something interesting happens: the score starts growing much faster (~N).
And for some tasks, the plot looks like a hockey stick (a sudden change from ~0 to almost-human).
...
Judging the preliminary results, the FOOM could start like this:
“The GPT-5 still sucks on most tasks. It’s mostly useless. But what if we increase parameters_num by 2? What could possibly go wrong?”
Hypothesis:
doing things in the real world requires diverse skills (strong performance on a diverse set of tasks)
hockey-sticking performance on a particular task makes that task no longer the constraint on what you can accomplish
but now some other task is the bottleneck
so, unless you can hockey-stick on all the tasks all at once, your overall ability to do things in the world will get smoothed out a bunch, even if it still grows very rapidly
Seems like there’s a spectrum between smooth accelerating progress and discontinuous takeoff. And where we end up on that spectrum depends on a few things:
how much simple improvements (better architecture, more compute) help with a wide variety of tasks
how much improvements in AI systems is bottlenecked on those tasks
how many resources the world is pouring into finding and making those improvements
Recent evidence (success of transformers, scaling laws) seems to suggest that Eliezer was right in the FOOM debate that simple input changes could make a large difference across a wide variety of tasks.
It’s less clear to me though whether that means a local system is going to outcompete the rest of the economy, because it seems plausible to me that the rest of the economy is also going to be full-steam ahead searching the same improvement space that a local system will be searching.
And I think in general real world complexity tends to smooth out lumpy graphs. As an example, even once we realize that GPT-2 is powerful and GPT-3 will be even better, there’s a whole bunch of engineering work that had to go into figuring out how to run such a big neural network across multiple machines.
That kind of real-world messiness seems like it will introduce new bottlenecks at every step along the way, and every order-of-magnitude change in scale, which makes me think that the actual impact of AI will be a lot more smooth than we might otherwise think just based on simple architectures being generally useful and scalable.
During the workshop presentation, Jascha said that the OpenAI will run their models on the benchmark. This suggests that there is (was?) some collaboration. But it was a half a year ago.
Just checked, the repo’s readme doesn’t mention OpenAI anymore. In the earlier versions, it was mentioned like this:
Teams at Google and OpenAI have committed to evaluate BIG-Bench on their best-performing model architectures
So, it seems that OpenAI withdrew from the project, partially or fully.
GPT-4 is expected to have about 10^14 parameters and be ready in a few years. And, we already know that GPT-3 can write code. The following all seem (to me at least) like very reasonable conjectures:
(i) Writing code is one of the tasks at which GPT-4 will have (at least) human level capability.
(ii) Clones of GPT-4 will be produced fairly rapidly after GPT-4, say 1-3 years.
(iii) GPT-4 and its clones will have a significant impact on society. This will show up in the real economy.
(iv) GPT-4 will be enough to shock governments into paying attention. (But as we have seen with climate change governments can pay attention to an issue for a long time without effectively doing anything about it.)
(v) Someone is going to ask for GPT-4 (clone) to produce code that generates AGI. (Implicitly, if not explicitly.)
I have absolutely no idea whether GPT-4 will succeed at this endeavor. But if not, GPT-5 should be available a few years later....
There was a Q&A where Sam Altman said GPT-4 is going to be a lot smaller than that (in particular, that it wouldn’t have a lot more parameters than GPT-3).
We might be able to falsify that in a few months.
There is a joint Google / OpenAI project called BIG-bench. They’ve crowdsourced ~200 of highly diverse text tasks (from answering scientific questions to predicting protein interacting sites to measuring self-awareness).
One of the goals of the project is to see how the performance on the tasks is changing with the model size, with the size ranging by many orders of magnitude.
A half-year ago, they presented some preliminary results. A quick summary:
if you increase the N of parameters from 10^7 to 10^10, the aggregate performance score grows roughly like log(N).
But after the 10^10 point, something interesting happens: the score starts growing much faster (~N).
And for some tasks, the plot looks like a hockey stick (a sudden change from ~0 to almost-human).
The paper with the full results is expected to be published in the next few months.
Judging by the preliminary results, the FOOM could start like this:
Hot damn, where can I see these preliminary results?
The results were presented at a workshop by the project organizers. The video from the workshop is available here (the most relevant presentation starts at 5:05:00).
It’s one of those innocent presentations that, after you understand the implications, keep you awake at night.
Presumably you’re referring to this graph. The y-axis looks like the kind of score that ranges between 0 and 1, in which case this looks sort-of like a sigmoid to me, which accelerates when it gets closer to ~50% performance (and decelarates when it gets closer to 100% performance).
If so, we might want to ask whether these tasks are chosen ~randomly (among tasks that are indicative of how useful AI is) or if they’re selected for difficulty in some way. In particular, assume that most tasks look sort-of like a sigmoid as they’re scaled up (accelerating around 50%, improving slower when they’re closer to 0% and 100%). Then you might think that the most exciting tasks to submit to big bench would be the tasks that can’t be handled by small models, but that large models rapidly improve upon (as opposed to tasks that are basically-solved already by 10^10 parameters). In which case the aggregation of all these tasks could be expected to look sort-of like this, improving faster after 10^10 than before.
...is one story I can tell, but idk if I would have predicted that beforehand, and fast acceleration after 10^10 is certainly consistent with many people’s qualitative impressions of GPT-3. So maybe there is some real acceleration going on.
(Also, see this post for similar curves, but for the benchmarks that OpenAI tested GPT-3 on. There’s no real acceleration visible there, other than for arithmetic.)
The preliminary results where obtained on a subset of the full benchmark (~90 tasks vs 206 tasks). And there were many changes since then, including scoring changes. Thus, I’m not sure we’ll see the same dynamics in the final results. Most likely yes, but maybe not.
I agree that the task selection process could create the dynamics that look like the acceleration. A good point.
As I understand, the organizers have accepted almost all submitted tasks (the main rejection reasons were technical—copyright etc). So, it was mostly self-selection, with the bias towards the hardest imaginable text tasks. It seems that for many contributors, the main motivation was something like:
This includes many cognitive tasks that are supposedly human-complete (e.g. understanding of humor, irony, ethics), and the tasks that are probing the model’s generality (e.g. playing chess, recognizing images, navigating mazes—all in text).
I wonder if the performance dynamics on such tasks will follow the same curve.
The list of of all tasks is available here.
Seems interestingly similar to the grokking phenomenon.
So these results are not reported in “Multitask Prompted Training Enables Zero-Shot Task Generalization”, Sanh et al 2021?
For Sanh et al. (2021), we were able to negotiate access to preliminary numbers from the BIG Bench project and run the T0 models on it. However the authors of Sanh et al. and the authors of BIG Bench are different groups of people.
The aforementioned Google’s Big-Bench paper is now publicly available:
some highlights
the paper
Nope. Although the linked paper uses the same benchmark (a tiny subset of it), the paper comes from a separate research project.
As I understand, the primary topic of the future paper will be the BIG-bench project itself, and how the models from Google / OpenAI perform on it.
Hypothesis:
doing things in the real world requires diverse skills (strong performance on a diverse set of tasks)
hockey-sticking performance on a particular task makes that task no longer the constraint on what you can accomplish
but now some other task is the bottleneck
so, unless you can hockey-stick on all the tasks all at once, your overall ability to do things in the world will get smoothed out a bunch, even if it still grows very rapidly
Seems like there’s a spectrum between smooth accelerating progress and discontinuous takeoff. And where we end up on that spectrum depends on a few things:
how much simple improvements (better architecture, more compute) help with a wide variety of tasks
how much improvements in AI systems is bottlenecked on those tasks
how many resources the world is pouring into finding and making those improvements
Recent evidence (success of transformers, scaling laws) seems to suggest that Eliezer was right in the FOOM debate that simple input changes could make a large difference across a wide variety of tasks.
It’s less clear to me though whether that means a local system is going to outcompete the rest of the economy, because it seems plausible to me that the rest of the economy is also going to be full-steam ahead searching the same improvement space that a local system will be searching.
And I think in general real world complexity tends to smooth out lumpy graphs. As an example, even once we realize that GPT-2 is powerful and GPT-3 will be even better, there’s a whole bunch of engineering work that had to go into figuring out how to run such a big neural network across multiple machines.
That kind of real-world messiness seems like it will introduce new bottlenecks at every step along the way, and every order-of-magnitude change in scale, which makes me think that the actual impact of AI will be a lot more smooth than we might otherwise think just based on simple architectures being generally useful and scalable.
What makes you say BIG Bench is a joint Google / OpenAI project? I’m a contributor to it and have seen no evidence of that.
During the workshop presentation, Jascha said that the OpenAI will run their models on the benchmark. This suggests that there is (was?) some collaboration. But it was a half a year ago.
Just checked, the repo’s readme doesn’t mention OpenAI anymore. In the earlier versions, it was mentioned like this:
So, it seems that OpenAI withdrew from the project, partially or fully.
OpenAI is still running evaluations.
Interesting… I was busy and wasn’t able to watch the workshop. That’s good to know, thanks!
GPT-4 is expected to have about 10^14 parameters and be ready in a few years. And, we already know that GPT-3 can write code. The following all seem (to me at least) like very reasonable conjectures:
(i) Writing code is one of the tasks at which GPT-4 will have (at least) human level capability.
(ii) Clones of GPT-4 will be produced fairly rapidly after GPT-4, say 1-3 years.
(iii) GPT-4 and its clones will have a significant impact on society. This will show up in the real economy.
(iv) GPT-4 will be enough to shock governments into paying attention. (But as we have seen with climate change governments can pay attention to an issue for a long time without effectively doing anything about it.)
(v) Someone is going to ask for GPT-4 (clone) to produce code that generates AGI. (Implicitly, if not explicitly.)
I have absolutely no idea whether GPT-4 will succeed at this endeavor. But if not, GPT-5 should be available a few years later....
(And, of course, this is just one pathway.)
There was a Q&A where Sam Altman said GPT-4 is going to be a lot smaller than that (in particular, that it wouldn’t have a lot more parameters than GPT-3).
You appear to be correct. I will withdraw my comment.