I think BIG-bench could be the final AI benchmark: if a language model surpasses the top human score on it, the model is an AGI.
Could you explain the reasoning behind this claim? Note that PaLM already beats the “human (Avg.)” on 150 tasks and the curve is not bending. (So is PaLM already an AGI?) It also looks like a scaled up Chinchilla would beat PaLM. It’s plausible that PaLM and Chinchilla could be improved by further finetuning and prompt engineering. Most tasks in BIG-Bench are multiple-choice, which is favorable to LMs (compared to generation). I’d guess that some tasks will leak into training data (despite the efforts of the authors to prevent this).
I agree, some future scaled-up versions of PaLM & Co may indeed be able to surpass top humans on BIG-Bench.
Could you explain the reasoning behind this claim? [ if a language model surpasses the top human score on it, the model is an AGI. ]
Ultimately, it’s the question of how we define “AGI”. One reasonable definition is “an AI that can do any cognitive task that humans can, and do it better than humans”.
Given its massive scope and diversity, BIG-bench seems to be a good enough proxy for “any cognitive task”.
Although I would use a stricter scoring than the average-across-tasks that was used in PaLM: the model must 1) beat top humans, 2) on each and every task of BIG-bench.
One could argue that the simple models like PaLM don’t have agency, goals, persistence of thought, self-awareness etc, and thus they can’t become the human-like AGI of science fiction. But it’s quite possible that such qualities are not necessary to do all cognitive tasks that humans can, but better.
A simple mechanistic algorithm can beat top humans in chess. Maybe another simple mechanistic algorithm can also beat top humans in science, poetry, AI engineering, strategic business management, childrearing, and in all other activities that make human intellectuals proud of themselves.
I’m curious why you think the correct standard is beats the top human on all tasks instead of beats the average human on all tasks. I think it is generally conceived of that humans are general intelligences and by definition humans are average here. Why wouldn’t a computer program that can do better than the average human on all relevant tasks be an AGI?
I agree with the sentiment, but would like to be careful with interpreting the average human scores for AI benchmarks. Such scores are obtained under time constrains. And maybe not all human raters were sufficiently motivated to do their best. The ratings for top humans are more likely to be representative of the general human ability to do the task.
Small remark; BIG-bench does include tasks on self-awareness, and I’d argue that it is a requirement for your definition “an AI that can do any cognitive tasks that humans can”, as well as being generally important for problem solving. Being able to correctly answer the question “Can I do task X?” is evidence of self-awareness and is clearly beneficial.
Although I’m not sure if self-awareness is necessary to surpass humans at all cognitive tasks. I can imagine a descendant of GPT that completely fails the self-awareness benchmarks, yet is able to write the most beautiful poetry, conduct Nobel-level research in physics, and even design a superior version of itself.
Could you explain the reasoning behind this claim? Note that PaLM already beats the “human (Avg.)” on 150 tasks and the curve is not bending. (So is PaLM already an AGI?) It also looks like a scaled up Chinchilla would beat PaLM. It’s plausible that PaLM and Chinchilla could be improved by further finetuning and prompt engineering. Most tasks in BIG-Bench are multiple-choice, which is favorable to LMs (compared to generation). I’d guess that some tasks will leak into training data (despite the efforts of the authors to prevent this).
Source for PaLM: https://arxiv.org/abs/2204.02311
I agree, some future scaled-up versions of PaLM & Co may indeed be able to surpass top humans on BIG-Bench.
Ultimately, it’s the question of how we define “AGI”. One reasonable definition is “an AI that can do any cognitive task that humans can, and do it better than humans”.
Given its massive scope and diversity, BIG-bench seems to be a good enough proxy for “any cognitive task”.
Although I would use a stricter scoring than the average-across-tasks that was used in PaLM: the model must 1) beat top humans, 2) on each and every task of BIG-bench.
One could argue that the simple models like PaLM don’t have agency, goals, persistence of thought, self-awareness etc, and thus they can’t become the human-like AGI of science fiction. But it’s quite possible that such qualities are not necessary to do all cognitive tasks that humans can, but better.
A simple mechanistic algorithm can beat top humans in chess. Maybe another simple mechanistic algorithm can also beat top humans in science, poetry, AI engineering, strategic business management, childrearing, and in all other activities that make human intellectuals proud of themselves.
I’m curious why you think the correct standard is beats the top human on all tasks instead of beats the average human on all tasks. I think it is generally conceived of that humans are general intelligences and by definition humans are average here. Why wouldn’t a computer program that can do better than the average human on all relevant tasks be an AGI?
I agree with the sentiment, but would like to be careful with interpreting the average human scores for AI benchmarks. Such scores are obtained under time constrains. And maybe not all human raters were sufficiently motivated to do their best. The ratings for top humans are more likely to be representative of the general human ability to do the task.
Small remark; BIG-bench does include tasks on self-awareness, and I’d argue that it is a requirement for your definition “an AI that can do any cognitive tasks that humans can”, as well as being generally important for problem solving. Being able to correctly answer the question “Can I do task X?” is evidence of self-awareness and is clearly beneficial.
I think you’re right on both points.
Although I’m not sure if self-awareness is necessary to surpass humans at all cognitive tasks. I can imagine a descendant of GPT that completely fails the self-awareness benchmarks, yet is able to write the most beautiful poetry, conduct Nobel-level research in physics, and even design a superior version of itself.