Seems like an easy way to create a less-fakeable benchmark would be to evaluate the LLM+scaffolding on multiple different games? Optimizing for beating Pokemon Red alone would of course be a cheap PR win, so people will try to do it. But optimizing for beating a wide variety of games would be a much bigger win, since it would probably require the AI to develop some more actually-valuable agentic capabilities.
It will probably be correct to chide people who update on the cheap PR win. But perhaps the bigger win, which would actually justify such updates, might come soon afterwards!
Seems like an easy way to create a less-fakeable benchmark would be to evaluate the LLM+scaffolding on multiple different games? Optimizing for beating Pokemon Red alone would of course be a cheap PR win, so people will try to do it. But optimizing for beating a wide variety of games would be a much bigger win, since it would probably require the AI to develop some more actually-valuable agentic capabilities.
It will probably be correct to chide people who update on the cheap PR win. But perhaps the bigger win, which would actually justify such updates, might come soon afterwards!