Correct. See a more complete list of scaffold features here.
Julian Bradshaw
This is kinda-sorta being done at the moment, after Gemini beat the game, the stream has just kept on going. Currently Gemini is lost in Mt. Moon, as is tradition. In fact, the fact that it already explored Mt. Moon earlier seems to be hampering it (no unexplored areas on minimap to lure it towards the right direction).
I believe the dev is planning to do a fresh run soon-ish once they’ve stabilized their scaffold.
Yeah it’s not open source or published anywhere unfortunately.
Gemini 2.5 Pro just beat Pokémon Blue. (https://x.com/sundarpichai/status/1918455766542930004)
A few things ended up being key to the successful run:
Map labeling—very detailed labeling of individual map tiles (including identifying tiles that move you to a new location (“warps” like doorways, ladders, cave entrances, etc.) and identifying puzzle entities)
Separate instances of Gemini with different, narrower prompts—these were used by the main Gemini playing the game to reason about certain tasks (ex. navigation, boulder puzzles, critique of current plans)
Detailed prompting—a lot of iteration on this (up to the point of ex. “if you’re navigating a long distance that crosses water midway through, make sure to use surf”)
For these and other reasons, it was not a “clean” win in a certain sense (nor a short one, it took over 100,000 thinking actions), but the victory is still a notable accomplishment. What’s next is LLMs beating Pokémon with less handholding and difficulty.
Yeah by “robust” I meant “can programmatically interact with game”.
There’s at least workable tools for Pokémon FireRed (the 2004 re-release of the 1996 original) it turns out, and you can find a scaffold using that here.
Open Source LLM Pokémon Scaffold
Yeah it is confusing. You’d think there’s tons of available data on pixelated game screens. Maybe training on it somehow degrades performance on other images?
I’ll let you know. They’re working on open-sourcing their scaffold at the moment.
Actually another group released VideoGameBench just a few days ago, which includes Pokémon Red among other games. Just a basic scaffold for Red, but that’s fair.
As I wrote in my other post:
Why hasn’t anyone run this as a rigorous benchmark? Probably because it takes multiple weeks to run a single attempt, and moreover a lot of progress comes down to effectively “model RNG”—ex. Gemini just recently failed Safari Zone, a difficult challenge, because its inventory happened to be full and it couldn’t accept an item it needed. And ex. Claude has taken wildly different amounts of time to exit Mt. Moon across attempts depending on how he happens to wander. To really run the benchmark rigorously, you’d need a sample of at least 10 full playthroughs, which would take perhaps a full year, at which point there’d be new models.
I think VideoGameBench has the right approach, which is to give only a basic scaffold (less than described in this post), and when LLMs can make quick, cheap progress through Pokemon Red (not taking weeks and tens of thousands of steps) using that, we’ll know real progress has been made.
Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red
Re: biosignatures detected on K2-18b, there’s been a couple popular takes saying this solves the Fermi Paradox: K2-18b is so big (8.6x Earth mass) that you can’t get to orbit, and maybe most life-bearing planets are like that.
This is wrong on several bases:
You can still get to orbit there, it’s just much harder (only 1.3g b/c of larger radius!) (https://x.com/CheerupR/status/1913991596753797383)
It’s much easier for us to detect large planets than small ones (https://exoplanets.nasa.gov/alien-worlds/ways-to-find-a-planet), but we expect small ones to be common too (once detected you can then do atmospheric spectroscopy via JWST to find biosignatures)
Assuming K2-18b does have life actually makes the Fermi paradox worse, because it strongly implies single-celled life is common in the galaxy, removing a potential Great Filter
I would say “agent harness” is a type of “scaffolding”. I used it in this case because it’s how Logan Kilpatrick described it in the tweet I linked at the beginning of the post.
I’m not sure that TAS counts as “AI” since they’re usually compiled by humans, but the “PokeBotBad” you linked is interesting, hadn’t heard of that before. It’s an Any% Glitchless speedrun bot that ran until ~2017 and which managed a solid 1:48:27 time on 2/25/17, which was better than the human world record until 2/12/18. Still, I’d say this is more a programmed “bot” than an AI in the sense we care about.
Anyway, you’re right that the whole reason the Pokémon benchmark exists is because it’s interesting to see how well an untrained LLM can do playing it.
Is Gemini now better than Claude at Pokémon?
since there’s no obvious reason why they’d be biased in a particular direction
No I’m saying there are obvious reasons why we’d be biased towards truthtelling. I mentioned “spread truth about AI risk” earlier, but also more generally one of our main goals is to get our map to match the territory as a collaborative community project. Lying makes that harder.
Besides sabotaging the community’s map, lying is dangerous to your own map too. As OP notes, to really lie effectively, you have to believe the lie. Well is it said, “If you once tell a lie, the truth is ever after your enemy.”
But to answer your question, it’s not wrong to do consequentialist analysis of lying. Again, I’m not Kantian, tell the guy here to randomly murder you whatever lie you want to survive. But I think there’s a lot of long-term consequences in less thought-experimenty cases that’d be tough to measure.
I’m not convinced SBF had conflicting goals, although it’s hard to know. But more importantly, I don’t agree rationalists “tend not to lie enough”. I’m no Kantian, to be clear, but I believe rationalists ought to aspire to a higher standard of truthtelling than the average person, even if there are some downsides to that.
Have we forgotten Sam Bankman-Fried already? Let’s not renounce virtues in the name of expected value so lightly.
Rationalism was founded partly to disseminate the truth about AI risk. It is hard to spread the truth when you are a known liar, especially when the truth is already difficult to believe.
Huh, seems you are correct. They also apparently are heavily cannibalistic, which might be a good impetus for modeling the intentions of other members of your species…
Oh okay. I agree it’s possible there’s no Great Filter.
Unless a dentist has told you to do this for some reason, you should know this is not recommended. Brushing hard can hurt tooth enamel and cause gum recession (aka your gums shrink down, causes lots of problems).