Cole Wyeth comments on So how well is Claude playing Pokémon?

Cole Wyeth 7 Mar 2025 12:55 UTC
20 points
−5
This is convincing evidence LLMs are far from AGI.
Eventually, one of the labs will solve it, a bunch of people will publicly update, and I’ll point out that actually the entire conversation about how an LLM should beat Pokémon was in the training data, the scaffolding was carefully set up to keep it on rails in this specific game, the available action set etc is essentially feature selection, etc.
What links here?
- Jackson Wagner's comment on A Bear Case: My Predictions Regarding AI Progress by Thane Ruthenis (11 Mar 2025 19:37 UTC; 14 points)
- Eva Lu 8 Mar 2025 5:02 UTC
  13 points
  7
  Parent
  I disagree because to me this just looks like LLMs are one algorithmic improvement away from having executive function, similar to how they couldn’t do system 2 style reasoning until this year when RL on math problems started working.
  For example, being unable to change its goals on the fly: If a kid kept trying to go forward when his pokemon were too weak. He would keep losing, get upset, and hopefully in a moment of mental clarity, learn the general principle that he should step back and reconsider his goals every so often. I think most children learn some form of this from playing around as a toddler, and reconsidering goals is still something we improve at as adults.
  Unlike us, I don’t think Claude has training data for executive functions like these, but I wouldn’t be surprised if some smart ML researchers solved this in a year.
  - Cole Wyeth 8 Mar 2025 15:08 UTC
    9 points
    3
    Parent
    They might solve it in a year, with one stunning conceptual insight. They might solve it in ten years or more. There’s no deciding evidence either way; by default, I expect the trend of punctuated equilibria in AI research to continue for some time.
- Jackson Wagner 8 Mar 2025 1:59 UTC
  13 points
  2
  Parent
  Seems like an easy way to create a less-fakeable benchmark would be to evaluate the LLM+scaffolding on multiple different games? Optimizing for beating Pokemon Red alone would of course be a cheap PR win, so people will try to do it. But optimizing for beating a wide variety of games would be a much bigger win, since it would probably require the AI to develop some more actually-valuable agentic capabilities.
  
  It will probably be correct to chide people who update on the cheap PR win. But perhaps the bigger win, which would actually justify such updates, might come soon afterwards!