Yeah, tough challenge. A lot of things that immediately come to mind for settling questions about AGI are complicated real-world tasks that are hard to convert into benchmarks (e.g. driving cars in complicated situations that require interpreting text on novel road signs). Especially for not-quite-general AI, where it might do well on some subset of necessary tasks but not enough of them to actually complete the full challenge.
So I guess what I want is granularity, where you break a task like “read and correctly react to a novel road sign while driving a car” into a series of steps, and test how good AIs are at those steps. But this is a lot of work, both for benchmark design and for AI developers who have to figure out how to get their AI to interface with all these different things it might be asked to do.
Yeah, tough challenge. A lot of things that immediately come to mind for settling questions about AGI are complicated real-world tasks that are hard to convert into benchmarks (e.g. driving cars in complicated situations that require interpreting text on novel road signs). Especially for not-quite-general AI, where it might do well on some subset of necessary tasks but not enough of them to actually complete the full challenge.
So I guess what I want is granularity, where you break a task like “read and correctly react to a novel road sign while driving a car” into a series of steps, and test how good AIs are at those steps. But this is a lot of work, both for benchmark design and for AI developers who have to figure out how to get their AI to interface with all these different things it might be asked to do.