Julian Bradshaw

Karma: 1,583

Julian Bradshaw May 23, 2025, 5:50 AM
2 points
0
in reply to: Garrett Baker’s comment on: Julian Bradshaw’s Shortform
Claude Sonnet 4 is still better than Claude 3.7 Sonnet without Extended Thinking. Given that 4 doesn’t seem to have an Extended Thinking mode, I’m not sure it’s really a performance degradation.

Julian Bradshaw May 23, 2025, 5:49 AM
2 points
0
in reply to: Garrett Baker’s comment on: Julian Bradshaw’s Shortform
I assume you mean MMMU? Looks like a 70.4% → 75% score improvement on the benchmark last jump, compared to just a 75% → 76.5% score improvement this time. I don’t think that’s a big difference, but I was wrong to say the improvement was “pure” reasoning improvements, my bad.

Julian Bradshaw May 23, 2025, 5:28 AM
2 points
0
in reply to: Garrett Baker’s comment on: Julian Bradshaw’s Shortform
That’s a big performance limitation for LLMs for sure, but Claude 3.7 Sonnet got two more badges than Claude 3.6 Sonnet. Pure reasoning improvements have led to more badges in the past.

Julian Bradshaw May 23, 2025, 5:26 AM
5 points
0
in reply to: Thane Ruthenis’s comment on: Julian Bradshaw’s Shortform
Yeah I feel like they came up with something nice to say while eliding the “no further progress” issue.
Weirdly, while the announcement talks about creating and maintaining multiple “memory files”, the new public ClaudePlaysPokemon stream has Claude Opus 4 using just a single memory file which it doesn’t even create. Apparently this is “much better” than the setup Claude 3.7 Sonnet used which let it create and maintain as many files at it wanted (usually to its detriment).
(source: this doc David Hershey just published on the new harness for the stream)
One other interesting tidbit I’ll throw in from the stream:
claudestans: @ClaudePlaysPokemon you mentioned before that there was a personality change from e.g. 3.5 sonnet to 3.7 (more persistent, less giving up etc). Have you noticed anything about opus 4 in terms of personality?

ClaudePlaysPokemon: Opus is so much better at keeping track of things that it gets more distressed when it can’t figure things out! So I need to convince it more that nothing is wrong, which I find quite interesting!
ClaudePlaysPokemon: like it will be very aware that it has taken 100 steps to solve something and it finds that very frustrating

Julian Bradshaw May 23, 2025, 1:05 AM
20 points
0
on: Julian Bradshaw’s Shortform
Apparently Claude 4 Opus hasn’t gotten further in Pokémon Red than Claude 3.7 Sonnet did. Just three gym badges obtained.

This is a surprising failure, and probably why Anthropic hasn’t released an updated version of their Pokémon benchmark. Instead, they just made some comments about improved long-term planning ability, which apparently doesn’t translate into measurable results?

Source: Wired reporter Kylie Robinson asked about it in-person to ClaudePlaysPokemon developer and Anthropic employee David Hershey: https://x.com/kyliebytes/status/1925617856449757364

Julian Bradshaw May 19, 2025, 6:38 AM
3 points
0
on: Gödel, Escher, Bach in the age of LLMs
It seems inconceivable that an LLM isn’t planning where it’s going when it writes the middle of a long argument, but that’s sure what it looks like it’s doing.
Actually it seems like more advanced LLMs do plan ahead, at least sometimes. See Anthropic’s “On the Biology of a Large Language Model” here, talking about writing poetry:
the model often activates features corresponding to candidate end-of-next-line words prior to writing the line, and makes use of these features to decide how to compose the line.

Julian Bradshaw May 18, 2025, 8:58 AM
19 points
23
on: What OpenAI Told California’s Attorney General
It says that “many potential investors in OpenAI’s recent funding rounds declined to invest” due to its unusual governance structure — directly contradicting Bloomberg’s earlier reporting that OpenAI’s October round was “oversubscribed.”
Nitpick: I don’t think that’s a contradiction. The October round could have been oversubscribed (people wanted to buy more ownership than OpenAI was willing to sell) even though some investors balked.
And anyway, I think the real complaint OpenAI made in that quoted paragraph is:
the investments that OpenAI was able to secure are conditioned on near-term structure changes.
which has been widely reported already. They were able to get the investment they wanted, but only by promising structural changes to the company.

Julian Bradshaw May 8, 2025, 6:34 AM
100 points
70
on: Orienting Toward Wizard Power
I tend to brush hard
Unless a dentist has told you to do this for some reason, you should know this is not recommended. Brushing hard can hurt tooth enamel and cause gum recession (aka your gums shrink down, causes lots of problems).

Julian Bradshaw May 4, 2025, 8:07 AM
2 points
0
in reply to: Cole Wyeth’s comment on: Julian Bradshaw’s Shortform
Correct. See a more complete list of scaffold features here.

Julian Bradshaw May 4, 2025, 7:45 AM
4 points
0
in reply to: Cole Wyeth’s comment on: Julian Bradshaw’s Shortform
This is kinda-sorta being done at the moment, after Gemini beat the game, the stream has just kept on going. Currently Gemini is lost in Mt. Moon, as is tradition. In fact, the fact that it already explored Mt. Moon earlier seems to be hampering it (no unexplored areas on minimap to lure it towards the right direction).
I believe the dev is planning to do a fresh run soon-ish once they’ve stabilized their scaffold.

Julian Bradshaw May 4, 2025, 7:41 AM
4 points
0
in reply to: Thane Ruthenis’s comment on: Julian Bradshaw’s Shortform
Yeah it’s not open source or published anywhere unfortunately.

Julian Bradshaw May 3, 2025, 4:59 AM
49 points
0
on: Julian Bradshaw’s Shortform
Gemini 2.5 Pro just beat Pokémon Blue. (https://x.com/sundarpichai/status/1918455766542930004)

A few things ended up being key to the successful run:
1. Map labeling—very detailed labeling of individual map tiles (including identifying tiles that move you to a new location (“warps” like doorways, ladders, cave entrances, etc.) and identifying puzzle entities)
2. Separate instances of Gemini with different, narrower prompts—these were used by the main Gemini playing the game to reason about certain tasks (ex. navigation, boulder puzzles, critique of current plans)
3. Detailed prompting—a lot of iteration on this (up to the point of ex. “if you’re navigating a long distance that crosses water midway through, make sure to use surf”)
For these and other reasons, it was not a “clean” win in a certain sense (nor a short one, it took over 100,000 thinking actions), but the victory is still a notable accomplishment. What’s next is LLMs beating Pokémon with less handholding and difficulty.

Julian Bradshaw May 1, 2025, 11:19 PM
2 points
0
in reply to: comex’s comment on: Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red
Yeah by “robust” I meant “can programmatically interact with game”.
There’s at least workable tools for Pokémon FireRed (the 2004 re-release of the 1996 original) it turns out, and you can find a scaffold using that here.

Open Source LLM Pokémon Scaffold

Julian BradshawApr 27, 2025, 12:57 AM

23 points

0 comments1 min readLW link

(github.com)

Julian Bradshaw Apr 22, 2025, 7:59 PM
4 points
0
in reply to: judgeka’s comment on: Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red
Yeah it is confusing. You’d think there’s tons of available data on pixelated game screens. Maybe training on it somehow degrades performance on other images?

Julian Bradshaw Apr 22, 2025, 7:33 PM
4 points
0
in reply to: qazzquimby’s comment on: Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red
I’ll let you know. They’re working on open-sourcing their scaffold at the moment.

Julian Bradshaw Apr 21, 2025, 4:31 PM
10 points
2
in reply to: FeepingCreature’s comment on: Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red
Actually another group released VideoGameBench just a few days ago, which includes Pokémon Red among other games. Just a basic scaffold for Red, but that’s fair.
As I wrote in my other post:
Why hasn’t anyone run this as a rigorous benchmark? Probably because it takes multiple weeks to run a single attempt, and moreover a lot of progress comes down to effectively “model RNG”—ex. Gemini just recently failed Safari Zone, a difficult challenge, because its inventory happened to be full and it couldn’t accept an item it needed. And ex. Claude has taken wildly different amounts of time to exit Mt. Moon across attempts depending on how he happens to wander. To really run the benchmark rigorously, you’d need a sample of at least 10 full playthroughs, which would take perhaps a full year, at which point there’d be new models.
I think VideoGameBench has the right approach, which is to give only a basic scaffold (less than described in this post), and when LLMs can make quick, cheap progress through Pokemon Red (not taking weeks and tens of thousands of steps) using that, we’ll know real progress has been made.

Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red

Julian BradshawApr 21, 2025, 3:52 AM

121 points

20 comments14 min readLW link

Julian Bradshaw Apr 20, 2025, 7:44 PM
5 points
0
on: Julian Bradshaw’s Shortform
Re: biosignatures detected on K2-18b, there’s been a couple popular takes saying this solves the Fermi Paradox: K2-18b is so big (8.6x Earth mass) that you can’t get to orbit, and maybe most life-bearing planets are like that.
This is wrong on several bases:
1. You can still get to orbit there, it’s just much harder (only 1.3g b/c of larger radius!) (https://x.com/CheerupR/status/1913991596753797383)
2. It’s much easier for us to detect large planets than small ones (https://exoplanets.nasa.gov/alien-worlds/ways-to-find-a-planet), but we expect small ones to be common too (once detected you can then do atmospheric spectroscopy via JWST to find biosignatures)
3. Assuming K2-18b does have life actually makes the Fermi paradox worse, because it strongly implies single-celled life is common in the galaxy, removing a potential Great Filter
Edit 5/24/25: Also it turns out the biosignatures might have just been noise anyway.

Julian Bradshaw Apr 20, 2025, 7:00 PM
5 points
2
in reply to: Ozyrus’s comment on: Is Gemini now better than Claude at Pokémon?
I would say “agent harness” is a type of “scaffolding”. I used it in this case because it’s how Logan Kilpatrick described it in the tweet I linked at the beginning of the post.

Julian Bradshaw

Open Source LLM Poké­mon Scaffold

Re­search Notes: Run­ning Claude 3.7, Gem­ini 2.5 Pro, and o3 on Poké­mon Red

Open Source LLM Pokémon Scaffold

Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red