The results were presented at a workshop by the project organizers. The video from the workshop is available here (the most relevant presentation starts at 5:05:00).
It’s one of those innocent presentations that, after you understand the implications, keep you awake at night.
Presumably you’re referring to this graph. The y-axis looks like the kind of score that ranges between 0 and 1, in which case this looks sort-of like a sigmoid to me, which accelerates when it gets closer to ~50% performance (and decelarates when it gets closer to 100% performance).
If so, we might want to ask whether these tasks are chosen ~randomly (among tasks that are indicative of how useful AI is) or if they’re selected for difficulty in some way. In particular, assume that most tasks look sort-of like a sigmoid as they’re scaled up (accelerating around 50%, improving slower when they’re closer to 0% and 100%). Then you might think that the most exciting tasks to submit to big bench would be the tasks that can’t be handled by small models, but that large models rapidly improve upon (as opposed to tasks that are basically-solved already by 10^10 parameters). In which case the aggregation of all these tasks could be expected to look sort-of like this, improving faster after 10^10 than before.
...is one story I can tell, but idk if I would have predicted that beforehand, and fast acceleration after 10^10 is certainly consistent with many people’s qualitative impressions of GPT-3. So maybe there is some real acceleration going on.
(Also, see this post for similar curves, but for the benchmarks that OpenAI tested GPT-3 on. There’s no real acceleration visible there, other than for arithmetic.)
The preliminary results where obtained on a subset of the full benchmark (~90 tasks vs 206 tasks). And there were many changes since then, including scoring changes. Thus, I’m not sure we’ll see the same dynamics in the final results. Most likely yes, but maybe not.
I agree that the task selection process could create the dynamics that look like the acceleration. A good point.
As I understand, the organizers have accepted almost all submitted tasks (the main rejection reasons were technical—copyright etc). So, it was mostly self-selection, with the bias towards the hardest imaginable text tasks. It seems that for many contributors, the main motivation was something like:
Take that, the most advanced AI of Google! Let’s see if you can handle my epic task!
This includes many cognitive tasks that are supposedly human-complete (e.g. understanding of humor, irony, ethics), and the tasks that are probing the model’s generality (e.g. playing chess, recognizing images, navigating mazes—all in text).
I wonder if the performance dynamics on such tasks will follow the same curve.
The results were presented at a workshop by the project organizers. The video from the workshop is available here (the most relevant presentation starts at 5:05:00).
It’s one of those innocent presentations that, after you understand the implications, keep you awake at night.
Presumably you’re referring to this graph. The y-axis looks like the kind of score that ranges between 0 and 1, in which case this looks sort-of like a sigmoid to me, which accelerates when it gets closer to ~50% performance (and decelarates when it gets closer to 100% performance).
If so, we might want to ask whether these tasks are chosen ~randomly (among tasks that are indicative of how useful AI is) or if they’re selected for difficulty in some way. In particular, assume that most tasks look sort-of like a sigmoid as they’re scaled up (accelerating around 50%, improving slower when they’re closer to 0% and 100%). Then you might think that the most exciting tasks to submit to big bench would be the tasks that can’t be handled by small models, but that large models rapidly improve upon (as opposed to tasks that are basically-solved already by 10^10 parameters). In which case the aggregation of all these tasks could be expected to look sort-of like this, improving faster after 10^10 than before.
...is one story I can tell, but idk if I would have predicted that beforehand, and fast acceleration after 10^10 is certainly consistent with many people’s qualitative impressions of GPT-3. So maybe there is some real acceleration going on.
(Also, see this post for similar curves, but for the benchmarks that OpenAI tested GPT-3 on. There’s no real acceleration visible there, other than for arithmetic.)
The preliminary results where obtained on a subset of the full benchmark (~90 tasks vs 206 tasks). And there were many changes since then, including scoring changes. Thus, I’m not sure we’ll see the same dynamics in the final results. Most likely yes, but maybe not.
I agree that the task selection process could create the dynamics that look like the acceleration. A good point.
As I understand, the organizers have accepted almost all submitted tasks (the main rejection reasons were technical—copyright etc). So, it was mostly self-selection, with the bias towards the hardest imaginable text tasks. It seems that for many contributors, the main motivation was something like:
This includes many cognitive tasks that are supposedly human-complete (e.g. understanding of humor, irony, ethics), and the tasks that are probing the model’s generality (e.g. playing chess, recognizing images, navigating mazes—all in text).
I wonder if the performance dynamics on such tasks will follow the same curve.
The list of of all tasks is available here.