Tasks that are completely beyond the capabilities of current language models are also encouraged
It’s easy to come up with a benchmark that requires a high but unspecified level of intelligence. An extreme example would be to ask for a proof that P!=NP—we have no idea about the difficulty of the task, though we suspect that it requires superintelligence. To be valuable, the challenge of a benchmark needs to possible to relate to meaningful capabilities, such as “The Human Level”.
Most people couldn’t answer questions about cryobiology in Spanish, even though they possess general intelligence. This benchmark seems to consist of random tasks around and above the human level, and I fear progress on this benchmark might be poorly correlated with progress towards AGI.
You’re right. And some of the existing tasks in the benchmark are way beyond the abilities of baseline humans (e.g. the image classification task where images are hex-encoded texts).
On the other hand, the organizers allowed the human testers to use any tool they want, including internet search, software etc. So, the measured top-human performance is the performance of humans augmented with technology.
I think an AI that can solve BIG-bench must be an AGI. But there could be an AGI that can’t solve BIG-bench yet.
The inclusion criteria states:
It’s easy to come up with a benchmark that requires a high but unspecified level of intelligence. An extreme example would be to ask for a proof that P!=NP—we have no idea about the difficulty of the task, though we suspect that it requires superintelligence. To be valuable, the challenge of a benchmark needs to possible to relate to meaningful capabilities, such as “The Human Level”.
Most people couldn’t answer questions about cryobiology in Spanish, even though they possess general intelligence. This benchmark seems to consist of random tasks around and above the human level, and I fear progress on this benchmark might be poorly correlated with progress towards AGI.
You’re right. And some of the existing tasks in the benchmark are way beyond the abilities of baseline humans (e.g. the image classification task where images are hex-encoded texts).
On the other hand, the organizers allowed the human testers to use any tool they want, including internet search, software etc. So, the measured top-human performance is the performance of humans augmented with technology.
I think an AI that can solve BIG-bench must be an AGI. But there could be an AGI that can’t solve BIG-bench yet.