Do you have any thoughts on a softer version of this problem, where the metric can’t be maximized directly, but gives a concrete idea of what sort of challenge your AI needs to beat to qualify as AGI? (And therefore in which direction in the architectural-design-space you should be moving.)
Some variation on this seems like it might work as a “fire alarm” test set, but as you point out, inasmuch as it’s recognized, it’ll be misapplied for benchmarking instead.
(I suppose the ideal way to do it would be to hand it off to e. g. ARC, so they can use it if OpenAI invites them for safety-testing again. This way, SOTA models still get tested, but the actors who might misuse it aren’t aware of the testing’s particulars until they succeed anyway...)
Do you have any thoughts on a softer version of this problem, where the metric can’t be maximized directly, but gives a concrete idea of what sort of challenge your AI needs to beat to qualify as AGI? (And therefore in which direction in the architectural-design-space you should be moving.)
Some variation on this seems like it might work as a “fire alarm” test set, but as you point out, inasmuch as it’s recognized, it’ll be misapplied for benchmarking instead.
(I suppose the ideal way to do it would be to hand it off to e. g. ARC, so they can use it if OpenAI invites them for safety-testing again. This way, SOTA models still get tested, but the actors who might misuse it aren’t aware of the testing’s particulars until they succeed anyway...)