I think humans doing METR’s tasks are more like “expert-level” rather than average/”human-level”. But current LLM agents are also far below human performance on tasks that don’t require any special expertise.
GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. [Note: The latest highest AI agent score is now 39%.] This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA’s philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system’s capability to exhibit similar robustness as the average human does on such questions.
And LLMs and VLLMs seriously underperform humans in VisualWebArena, which tests for simple web-browsing capabilities:
I don’t know if being able to autonomously make money should be a necessary condition to qualify as AGI. But I would feel uncomfortable calling a system AGI if it can’t match human performance at simple agent tasks.
I think METR is aiming for expert level tasks, but I think their current task set is closer in difficulty to GAIA and VisualWebArena than what I would consider human expert level difficulty. It’s tricky to decide though, since LLMs circa 2024 seem really good at some stuff that is quite hard to humans, and bad at a set of stuff easy to humans. If the stuff they are currently bad at gets brought up to human level, without a decrease in skill at the stuff LLMs are above-human at, the result would be a system well into the superhuman range. So where we draw the line for human level necessarily involves a tricky value-weighting problem of the various skills involved.
I think humans doing METR’s tasks are more like “expert-level” rather than average/”human-level”. But current LLM agents are also far below human performance on tasks that don’t require any special expertise.
From GAIA:
And LLMs and VLLMs seriously underperform humans in VisualWebArena, which tests for simple web-browsing capabilities:
I don’t know if being able to autonomously make money should be a necessary condition to qualify as AGI. But I would feel uncomfortable calling a system AGI if it can’t match human performance at simple agent tasks.
I think METR is aiming for expert level tasks, but I think their current task set is closer in difficulty to GAIA and VisualWebArena than what I would consider human expert level difficulty. It’s tricky to decide though, since LLMs circa 2024 seem really good at some stuff that is quite hard to humans, and bad at a set of stuff easy to humans. If the stuff they are currently bad at gets brought up to human level, without a decrease in skill at the stuff LLMs are above-human at, the result would be a system well into the superhuman range. So where we draw the line for human level necessarily involves a tricky value-weighting problem of the various skills involved.