The preliminary results where obtained on a subset of the full benchmark (~90 tasks vs 206 tasks). And there were many changes since then, including scoring changes. Thus, I’m not sure we’ll see the same dynamics in the final results. Most likely yes, but maybe not.
I agree that the task selection process could create the dynamics that look like the acceleration. A good point.
As I understand, the organizers have accepted almost all submitted tasks (the main rejection reasons were technical—copyright etc). So, it was mostly self-selection, with the bias towards the hardest imaginable text tasks. It seems that for many contributors, the main motivation was something like:
Take that, the most advanced AI of Google! Let’s see if you can handle my epic task!
This includes many cognitive tasks that are supposedly human-complete (e.g. understanding of humor, irony, ethics), and the tasks that are probing the model’s generality (e.g. playing chess, recognizing images, navigating mazes—all in text).
I wonder if the performance dynamics on such tasks will follow the same curve.
The preliminary results where obtained on a subset of the full benchmark (~90 tasks vs 206 tasks). And there were many changes since then, including scoring changes. Thus, I’m not sure we’ll see the same dynamics in the final results. Most likely yes, but maybe not.
I agree that the task selection process could create the dynamics that look like the acceleration. A good point.
As I understand, the organizers have accepted almost all submitted tasks (the main rejection reasons were technical—copyright etc). So, it was mostly self-selection, with the bias towards the hardest imaginable text tasks. It seems that for many contributors, the main motivation was something like:
This includes many cognitive tasks that are supposedly human-complete (e.g. understanding of humor, irony, ethics), and the tasks that are probing the model’s generality (e.g. playing chess, recognizing images, navigating mazes—all in text).
I wonder if the performance dynamics on such tasks will follow the same curve.
The list of of all tasks is available here.