What percentage of humans does it need to beat at a given set of benchmarks?
If I wanted to prove I have a t AGI do I need to be at 99th percentile or 50th percentile for each benchmark, and out of all the benchmarks how many of them do we need to “pass”? (Pass meaning “declare victory”, but the numerical score is the thing that matters)
Theres also major issues with benchmarking because current AI are stupidly good at learning to pass any test, so leaked questions cause a problem.
In a way what you are asking is for an AGI architecture: the AGI architecture would get trained on data captured before the test questions were developed, and you’re measuring the ability of that architecture to use all the training data and it’s cognitive architecture on these benchmark tasks.
Certain questions like “build a successful company” or other complex real world tasks have the problem that each successful company was only possible if founded for a target (product, market, time). Miss any of those and it will fail even if the AGI does a better job than human entrepreneurs.
What percentage of humans does it need to beat at a given set of benchmarks?
If I wanted to prove I have a t AGI do I need to be at 99th percentile or 50th percentile for each benchmark, and out of all the benchmarks how many of them do we need to “pass”? (Pass meaning “declare victory”, but the numerical score is the thing that matters)
Theres also major issues with benchmarking because current AI are stupidly good at learning to pass any test, so leaked questions cause a problem.
In a way what you are asking is for an AGI architecture: the AGI architecture would get trained on data captured before the test questions were developed, and you’re measuring the ability of that architecture to use all the training data and it’s cognitive architecture on these benchmark tasks.
Certain questions like “build a successful company” or other complex real world tasks have the problem that each successful company was only possible if founded for a target (product, market, time). Miss any of those and it will fail even if the AGI does a better job than human entrepreneurs.
I think this specifies both thresholds to be 50%.