Metaculus is strongly considering organizing an annual AGI benchmarking event. Once a year, we’d run a benchmark or suite of benchmarks against the most generally intelligent AI systems available to us at the time, seeking to assess their generality and the overall shape of their capabilities. We would publicize the event widely among the AI research, policy, and forecasting communities.
Why?
We think this might be a good idea for several reasons:
The event could provide a convening ground for the AI research community, helping it to arrive at a shared understanding of the current state of AGI research, and acting as a focal point for rational discussion on the future of AGI.
An annual benchmarking event has advantages over static, run-any-time benchmarks when it comes to testing generality. Unless one constrains the training data and restricts the hard-coded knowledge used by systems under evaluation, developers may directly optimize for a static benchmark while building their systems, which makes static benchmarks less useful as measures of generality. With the annual format, we are free to change the tasks every year without informing developers of what they will be beforehand, thereby assessing what François Chollet terms developer-aware generalization.
Frequent feedback improves performance in almost any domain; this event could provide a target for AGI forecasting that yields yearly feedback, allowing us to iterate on our approaches and hone our understanding of how to forecast the development of AGI.
The capabilities of an AGI will not be completely boundless, so it’s interesting to ask what its strengths and limitations are likely to be. If designed properly, our benchmarks could give us clues as to what the “shape” of AGI capabilities may turn out to be.
How?
We’re currently working on a plan, and are soliciting ideas and feedback from the community here. To guide the discussion, here are some properties we think the ideal benchmark should have. It would:
Engage a broad, diverse set of AI researchers, and act as a focal point for rational, forecasting-based discussion on the future of AGI.
Measure the generality and adaptability of intelligent systems, not just their performance on a fixed, known-beforehand set of tasks.
Form the basis for AGI forecasting questions with a one-year lifetime.
Generate predictive signal as to the types of capabilities that an AGI system is likely to possess.
Provide a quantitative measure of generality, ranking more general systems above less general ones, rather than giving a binary “general or not” outcome.
Be sensitive to differences in generality even among the weakly general systems available today.
Once we’ve collected the strongest ideas and developed them into a cohesive whole, we will solicit feedback from the AI research community before publishing the final plan. Thanks for your contributions to the discussion – we look forward to reading and engaging with your ideas!
Background reading
Here are a few resources to get you thinking.
Threads:
An idea based on iteratively crowdsourcing adversarial questions
A discussion on AGI benchmarking
Papers:
On the Measure of Intelligence
What we might look for in an AGI benchmark
General intelligence disentangled via a generality metric for natural and artificial intelligence
Yeah, tough challenge. A lot of things that immediately come to mind for settling questions about AGI are complicated real-world tasks that are hard to convert into benchmarks (e.g. driving cars in complicated situations that require interpreting text on novel road signs). Especially for not-quite-general AI, where it might do well on some subset of necessary tasks but not enough of them to actually complete the full challenge.
So I guess what I want is granularity, where you break a task like “read and correctly react to a novel road sign while driving a car” into a series of steps, and test how good AIs are at those steps. But this is a lot of work, both for benchmark design and for AI developers who have to figure out how to get their AI to interface with all these different things it might be asked to do.
There is plausibly one big purpose it serves, and that’s making it as a legible fire alarm for AGI so that we can actually know when to respond. Eliezer Yudkowsky stated that one of the biggest challenges to AGI Alignment is the fact that there is no fire alarm if AGI is built, and this can be used as a way to make a fire alarm.
Very cool. Something I’ve been wondering about recently is how to keep benchmarks from becoming available and optimized against by developers. I had an idea for a benchmark, and thought about submitting it to BigBench, but then decided against it since I was worried that the precautions against BigBench being optimized against were insufficient. It seems like a company that wanted to keep its model private and expose just a filtered API is a good idea, but what about the evaluators wanting to keep their benchmarks private? Maybe there could be a special arrangement where the companies agreed not to record the benchmarking as it passed through their APIs to their model?