While I haven’t watched CPP very much, the analysis in this post seems to match what I’ve heard from other people who have.
That said, I think claims like
So, how’s it doing? Well, pretty badly. Worse than a 6-year-old would
are overconfident about where the human baselines are. Moreover, I think these sorts of claims reflect a general blindspot about how humans can get stuck on trivial obstacles in the same way AIs do.
A personal anecdote: when I was a kid (maybe 3rd or 4th grade, so 8 or 9 years old) I played Pokemon red and couldn’t figure out how to get out of the first room—same as the Claude 3.0 Sonnet performance! Why? Well is it obvious to you where the exit to this room is?
Answer: you have to stand on the carpet and press down.
Apparently this was a common issue! See this reddit thread for discussion of people who hit the same snag as me. In fact, it was a big enough issue that addressed it in the FireRed remake, making the rug stick out a bit:
I don’t think this is an isolated issue with the first room. Rather, I think that as railroaded as Pokemon might seem, there’s actually a bunch of things that it’s easy to get crucially confused about, resulting in getting totally stuck for a dumb reason until someone helps you out.
Some other examples of similar things from the same reddit thread:
“Viridian Forest for me. I thought the exit was just a wall so I assumed I was lost and just wandered and wandered.”
“When I got Blue, I traveled all the way to Mt. Moon, and all of my party fainted, right? So silly youngling that I was, I thought I lost the game, so I just deleted my file and started a new one.”
“In Sapphire/Ruby there was a bridge/bike path that you had to walk under. Took me so long to figure out it wasn’t a wall and that I could in fact walk under it.”
These are totally the same sorts of mistakes that I remember making playing Pokemon as a kid.
Further, have you ever gotten an adult who doesn’t normally play video games to try playing one? They have a tendency to get totally stuck in tutorial levels because game developers rely on certain “video game motifs” for load-bearing forms of communication; see e.g. this video.
I don’t think this is specific to video games: In most things I try to do, I run up against stupid, fake walls where there’s something obvious that I just “don’t get.” Fortunately, I’m able to do things like ask someone for a fresh pair of eyes or search the internet. Without this ability, I think I would have to abandon basically all of the core things I work on. When I need to help out people with worse “executive function”/”problem solving ability” than me—like relatives that need basic tech help—usually the main thing I do to unstuck them is “google their problem.”
(As a more narrow point, I’m extremely dubious that the way to interpret howlongtobeat’s 26 hour number as representing the time that it would take an average human to beat Pokemon Red, even assuming that the humans are adults and that we entirely discard failed playthroughs.)
Further, have you ever gotten an adult who doesn’t normally play video games to try playing one? They have a tendency to get totally stuck in tutorial levels because game developers rely on certain “video game motifs” for load-bearing forms of communication; see e.g. this video.
So much +1 on this.
Also, I’ve played a ton of games, and in the last few years started helping a bit with playtesting them etc. And I found it striking how games aren’t inherently intuitive, but are rather made so via strong economic incentives, endless playtests to stop players from getting stuck, etc. Games are intuitive for humans because humans spend a ton of effort to make them that way. If AIs were the primary target audience, games would be made intuitive for them.
And as a separate note, I’m not sure what the appropriate human reference class for game-playing AIs is, but I challenge the assumption that it should be people who are familiar with games. Rather than, say, people picked at random from anywhere on earth.
And as a separate note, I’m not sure what the appropriate human reference class for game-playing AIs is, but I challenge the assumption that it should be people who are familiar with games. Rather than, say, people picked at random from anywhere on earth.
Should maybe restrict it to someone who has read all the documentation and discussion for the game that exists on the internet.
And as a separate note, I’m not sure what the appropriate human reference class for game-playing AIs is, but I challenge the assumption that it should be people who are familiar with games. Rather than, say, people picked at random from anywhere on earth.
If you did that for programming, AIs would already be considered strongly superhuman. Just like we compare AI’s coding knowledge to programmers, I think it’s perfectly fair to compare their gaming abilities to people who play video games.
Yeah but we train AIs on coding before we make that comparison. And we know that if you train an AI on a videogame it can often get superhuman performance. Here we’re trying to look at pure transfer learning, so I think it would be pretty fair to compare to someone who is generally competent but has never played videogames. Another interesting question is to what extent you can train an AI system on a variety of videogames and then have it take on a new one with no game-specific training. I don’t know if anyone has tried that with LLMs yet.
I am not a 100% convinced by the comparison, because technically LLMs are only “reading” a bunch of source code, they are never given access to a compiler/interpreter. IMO actually running the code one has written is a very important part of learning, and I think it would be a much more difficult task for a human to learn to code just by reading a bunch of books/code, but never actually trying to write & run their own code.[1]
Also, in the video linked earlier in the thread, the girlfriend playing Terraria is deliberately not given access to the wiki, and thus I believe is an unfair comparison. I expect to see much better human performance if you give them access to manuals & wikis about the game.
Another interesting question is to what extent you can train an AI system on a variety of videogames and then have it take on a new one with no game-specific training. I don’t know if anyone has tried that with LLMs yet.
Not sure either, but I agree that this would be an interesting experiment. (Human gamers are often much quicker at picking up new games and are much better at them than someone with no gaming background.)
I would expect the average human to stay very bad at coding, no matter how many books & code examples you give them. I would also expect some smaller class of humans to nevertheless be able to pull that feat off. (E.g. maybe a mathematician well versed in formal logic, who is used to doing complex symbolic manipulation correctly “only on paper”, could probably write non-trivial correct programs just by reading about the subject. In fact, a lot of stuff from computer science was worked out well before computers were built, e.g. Ada Lovelace is usually credited with writing the “first computer program”, well before the first digital computer existed.)
I kind of see your point about having all the game wikis, but I think I disagree about learning to code being necessarily interactive. Think about what feedback the compiler provides you: it tells you if you made a mistake, and sometimes what the mistake was. In cases where it runs but doesn’t do what you wanted, it might “show” you what the mistake was instead. You can learn programming just fine by reading and writing code but never running it, if you also have somebody knowledgeable checking what you wrote and explaining your mistakes. LLMs have tons of examples of that kind of thing in their training data.
I’m not sure. I remember playing a bunch of games, like pokemon heart gold, lego starwars, and some other pokemon game where you were controlling little pokemon in 3rd person instead of controlling a human who threw pokeballs (anyone know that game? )
And like, I didn’t speak English when I played them. So I had to figure out everything by just pressing random buttons and seeing responses. And this makes it a lot more difficult. Like I could open my “inventory” (didn’t know what that was) and then use a “healing potion” (didn’t know what that was), and then because my pokemon was at full health already, I would think the healing potion was useless, or think that items in inventory only cause text to appear on the screen, but that they don’t have any effect on the actaul, and then I’d believe this until I accidentally clicked the inventory and randomly saw a change, or had failed a level so many times that I was getting desperate and just manually doing exhaustive search over all the actions.
But like, I’m very confident I was more action efficient than claude is. Mostly because like, if I enter a battle, and like fail 5 times more or less in the same way, you start to think something is awry, and start doing different stuff. And also just because, certain things become automatic after a short while, like moving around. For claude it takes the same amount of time each time. So if you’re failing at a specific point in a battle, the fact that that point is responsible for you overall failing to progress, becomes very obvious, because anything other than that becomes automatic and trivial and you just do it instantly.
Possibly amusing anecdote: when I was maybe ~6, my dad went on a business trip and very kindly brought home the new Pokémon Silver for me. Only complication was, his trip had been to Japan, and the game was in Japanese (it wasn’t yet released in the US market), and somehow he hadn’t realized this.
I managed to play it reasonably well for a while based on my knowledge of other Pokémon games. But eventually I ran into a person blocking a bridge, who (I presumed) was saying something about what I needed to do before I could advance. But, I didn’t understand what they were saying because it was in Japanese.
I had planned to seek out someone who spoke Japanese, and ask their help translating for me, but unfortunately there was almost nobody in my town who did. And so instead I resolved to learn Japanese—and that’s the story of what led to me becoming fluent at a young age.
(Just kidding—after flailing around a bit with possibly bypasses, I gave up on playing the game until I got the US version.)
some other pokemon game where you were controlling little pokemon in 3rd person instead of controlling a human who threw pokeballs (anyone know that game? )
It’s definitely possible to get confused playing Pokémon Red, but as a human, you’re much better at getting unstuck. You try new things, have more consistent strategies, and learn better from mistakes. If you tried as long and as consistently as long as Claude is, even as a 6-year-old, you’d do much better.
I played Pokémon Red as a kid too (still have the cartridge!), it wasn’t easy, but I beat it in something like that 26 hour number IIRC. You have a point that howlongtobeat is biased towards gamers, but it’s the most objective number I can find, and it feels reasonable to me.
I’m not sure! Or well, I agree that 7-year-old me could get unstuck by virtue of having an “additional tool” called “get frustrated and cry until my mom took pity and helped.”[1] But we specifically prevent Claude from doing stuff like that!
I think it’s plausible that if we took an actual 6-year-old and asked them to play Pokemon on a Twitch stream, we’d see many of the things you highlight as weaknesses of Claude: getting stuck against trivial obstacles, forgetting what they were doing, and—yes—complaining that the game is surely broken.
TBC this is exaggerated for effect—I don’t remember actually doing this for Pokemon. And—to your point—I probably did eventually figure out on my own most of the things I remember getting stuck on.
Pokemon is a game literally made to be played and beaten by children. Six years old might be pushing the lower bound, but it didn’t become one of the largest gaming and entertainment franchises in the world by being too difficult to play for children, whom the game is designed for.
Yes, kids get stuck and they do use extra resources like searching up info on game guides (old man moment, before the internet you had to find a friend who had the physical version and would let you borrow or look at it). But is the ability to search the internet the bottleneck that prevents Claude from getting past Mt. Moon in under 50 hours? That does not seem likely. In fact giving it access to the internet where it can get even more lost with potentially additional useless or irrelevant information could make the problem worse.
Yeah, I think that probably if the claim had been “worse than a 9 year old” then I wouldn’t have had much to complain about. I somewhat regret phrasing my original comment as a refutation of the “worse than a 6 year old” and “26 hour” claims, when really I was just using those as a jumping-off point to say some interesting-to-me stuff about how humans also get stuck on trivial obstacles in the same ways that AIs do.
I do feel like it’s a bit cleaner to factor apart Claude’s weaknesses into “memory,” “vision,” and “executive function” rather than bundling those issues together in the way the OP does at times. (Though obviously these are related, especially memory and executive function.) Then I would guess that Claude’s executive function actually isn’t that bad and might even be ≥human level. But it’s hard to say because the memory—especially visual memory—really does seem worse than a 6 year old’s.
I think that probably internet access would help substantially.
It would be so awesome to have such a stream as additional reference point—just one six year old without internet and external help doing a Pokemon run
While I haven’t watched CPP very much, the analysis in this post seems to match what I’ve heard from other people who have.
That said, I think claims like
are overconfident about where the human baselines are. Moreover, I think these sorts of claims reflect a general blindspot about how humans can get stuck on trivial obstacles in the same way AIs do.
A personal anecdote: when I was a kid (maybe 3rd or 4th grade, so 8 or 9 years old) I played Pokemon red and couldn’t figure out how to get out of the first room—same as the Claude 3.0 Sonnet performance! Why? Well is it obvious to you where the exit to this room is?
Answer: you have to stand on the carpet and press down.
Apparently this was a common issue! See this reddit thread for discussion of people who hit the same snag as me. In fact, it was a big enough issue that addressed it in the FireRed remake, making the rug stick out a bit:
I don’t think this is an isolated issue with the first room. Rather, I think that as railroaded as Pokemon might seem, there’s actually a bunch of things that it’s easy to get crucially confused about, resulting in getting totally stuck for a dumb reason until someone helps you out.
Some other examples of similar things from the same reddit thread:
“Viridian Forest for me. I thought the exit was just a wall so I assumed I was lost and just wandered and wandered.”
“When I got Blue, I traveled all the way to Mt. Moon, and all of my party fainted, right? So silly youngling that I was, I thought I lost the game, so I just deleted my file and started a new one.”
“In Sapphire/Ruby there was a bridge/bike path that you had to walk under. Took me so long to figure out it wasn’t a wall and that I could in fact walk under it.”
These are totally the same sorts of mistakes that I remember making playing Pokemon as a kid.
Further, have you ever gotten an adult who doesn’t normally play video games to try playing one? They have a tendency to get totally stuck in tutorial levels because game developers rely on certain “video game motifs” for load-bearing forms of communication; see e.g. this video.
I don’t think this is specific to video games: In most things I try to do, I run up against stupid, fake walls where there’s something obvious that I just “don’t get.” Fortunately, I’m able to do things like ask someone for a fresh pair of eyes or search the internet. Without this ability, I think I would have to abandon basically all of the core things I work on. When I need to help out people with worse “executive function”/”problem solving ability” than me—like relatives that need basic tech help—usually the main thing I do to unstuck them is “google their problem.”
(As a more narrow point, I’m extremely dubious that the way to interpret howlongtobeat’s 26 hour number as representing the time that it would take an average human to beat Pokemon Red, even assuming that the humans are adults and that we entirely discard failed playthroughs.)
So much +1 on this.
Also, I’ve played a ton of games, and in the last few years started helping a bit with playtesting them etc. And I found it striking how games aren’t inherently intuitive, but are rather made so via strong economic incentives, endless playtests to stop players from getting stuck, etc. Games are intuitive for humans because humans spend a ton of effort to make them that way. If AIs were the primary target audience, games would be made intuitive for them.
And as a separate note, I’m not sure what the appropriate human reference class for game-playing AIs is, but I challenge the assumption that it should be people who are familiar with games. Rather than, say, people picked at random from anywhere on earth.
Should maybe restrict it to someone who has read all the documentation and discussion for the game that exists on the internet.
Fair. But then also restrict it to someone who has no hands, eyes, etc.
If you did that for programming, AIs would already be considered strongly superhuman. Just like we compare AI’s coding knowledge to programmers, I think it’s perfectly fair to compare their gaming abilities to people who play video games.
By this I was mainly arguing against claims like that this performance is “worse than a human 6-year-old”.
Yeah but we train AIs on coding before we make that comparison. And we know that if you train an AI on a videogame it can often get superhuman performance. Here we’re trying to look at pure transfer learning, so I think it would be pretty fair to compare to someone who is generally competent but has never played videogames. Another interesting question is to what extent you can train an AI system on a variety of videogames and then have it take on a new one with no game-specific training. I don’t know if anyone has tried that with LLMs yet.
I am not a 100% convinced by the comparison, because technically LLMs are only “reading” a bunch of source code, they are never given access to a compiler/interpreter. IMO actually running the code one has written is a very important part of learning, and I think it would be a much more difficult task for a human to learn to code just by reading a bunch of books/code, but never actually trying to write & run their own code.[1]
Also, in the video linked earlier in the thread, the girlfriend playing Terraria is deliberately not given access to the wiki, and thus I believe is an unfair comparison. I expect to see much better human performance if you give them access to manuals & wikis about the game.
Not sure either, but I agree that this would be an interesting experiment. (Human gamers are often much quicker at picking up new games and are much better at them than someone with no gaming background.)
I would expect the average human to stay very bad at coding, no matter how many books & code examples you give them. I would also expect some smaller class of humans to nevertheless be able to pull that feat off. (E.g. maybe a mathematician well versed in formal logic, who is used to doing complex symbolic manipulation correctly “only on paper”, could probably write non-trivial correct programs just by reading about the subject. In fact, a lot of stuff from computer science was worked out well before computers were built, e.g. Ada Lovelace is usually credited with writing the “first computer program”, well before the first digital computer existed.)
I kind of see your point about having all the game wikis, but I think I disagree about learning to code being necessarily interactive. Think about what feedback the compiler provides you: it tells you if you made a mistake, and sometimes what the mistake was. In cases where it runs but doesn’t do what you wanted, it might “show” you what the mistake was instead. You can learn programming just fine by reading and writing code but never running it, if you also have somebody knowledgeable checking what you wrote and explaining your mistakes. LLMs have tons of examples of that kind of thing in their training data.
I’m not sure. I remember playing a bunch of games, like pokemon heart gold, lego starwars, and some other pokemon game where you were controlling little pokemon in 3rd person instead of controlling a human who threw pokeballs (anyone know that game? )
And like, I didn’t speak English when I played them. So I had to figure out everything by just pressing random buttons and seeing responses. And this makes it a lot more difficult. Like I could open my “inventory” (didn’t know what that was) and then use a “healing potion” (didn’t know what that was), and then because my pokemon was at full health already, I would think the healing potion was useless, or think that items in inventory only cause text to appear on the screen, but that they don’t have any effect on the actaul, and then I’d believe this until I accidentally clicked the inventory and randomly saw a change, or had failed a level so many times that I was getting desperate and just manually doing exhaustive search over all the actions.
But like, I’m very confident I was more action efficient than claude is. Mostly because like, if I enter a battle, and like fail 5 times more or less in the same way, you start to think something is awry, and start doing different stuff. And also just because, certain things become automatic after a short while, like moving around. For claude it takes the same amount of time each time. So if you’re failing at a specific point in a battle, the fact that that point is responsible for you overall failing to progress, becomes very obvious, because anything other than that becomes automatic and trivial and you just do it instantly.
Possibly amusing anecdote: when I was maybe ~6, my dad went on a business trip and very kindly brought home the new Pokémon Silver for me. Only complication was, his trip had been to Japan, and the game was in Japanese (it wasn’t yet released in the US market), and somehow he hadn’t realized this.
I managed to play it reasonably well for a while based on my knowledge of other Pokémon games. But eventually I ran into a person blocking a bridge, who (I presumed) was saying something about what I needed to do before I could advance. But, I didn’t understand what they were saying because it was in Japanese.
I had planned to seek out someone who spoke Japanese, and ask their help translating for me, but unfortunately there was almost nobody in my town who did. And so instead I resolved to learn Japanese—and that’s the story of what led to me becoming fluent at a young age.
(Just kidding—after flailing around a bit with possibly bypasses, I gave up on playing the game until I got the US version.)
Probably Pokemon Mystery Dungeon.
It’s definitely possible to get confused playing Pokémon Red, but as a human, you’re much better at getting unstuck. You try new things, have more consistent strategies, and learn better from mistakes. If you tried as long and as consistently as long as Claude is, even as a 6-year-old, you’d do much better.
I played Pokémon Red as a kid too (still have the cartridge!), it wasn’t easy, but I beat it in something like that 26 hour number IIRC. You have a point that howlongtobeat is biased towards gamers, but it’s the most objective number I can find, and it feels reasonable to me.
I’m not sure! Or well, I agree that 7-year-old me could get unstuck by virtue of having an “additional tool” called “get frustrated and cry until my mom took pity and helped.”[1] But we specifically prevent Claude from doing stuff like that!
I think it’s plausible that if we took an actual 6-year-old and asked them to play Pokemon on a Twitch stream, we’d see many of the things you highlight as weaknesses of Claude: getting stuck against trivial obstacles, forgetting what they were doing, and—yes—complaining that the game is surely broken.
TBC this is exaggerated for effect—I don’t remember actually doing this for Pokemon. And—to your point—I probably did eventually figure out on my own most of the things I remember getting stuck on.
Pokemon is a game literally made to be played and beaten by children. Six years old might be pushing the lower bound, but it didn’t become one of the largest gaming and entertainment franchises in the world by being too difficult to play for children, whom the game is designed for.
Yes, kids get stuck and they do use extra resources like searching up info on game guides (old man moment, before the internet you had to find a friend who had the physical version and would let you borrow or look at it). But is the ability to search the internet the bottleneck that prevents Claude from getting past Mt. Moon in under 50 hours? That does not seem likely. In fact giving it access to the internet where it can get even more lost with potentially additional useless or irrelevant information could make the problem worse.
Yeah, I think that probably if the claim had been “worse than a 9 year old” then I wouldn’t have had much to complain about. I somewhat regret phrasing my original comment as a refutation of the “worse than a 6 year old” and “26 hour” claims, when really I was just using those as a jumping-off point to say some interesting-to-me stuff about how humans also get stuck on trivial obstacles in the same ways that AIs do.
I do feel like it’s a bit cleaner to factor apart Claude’s weaknesses into “memory,” “vision,” and “executive function” rather than bundling those issues together in the way the OP does at times. (Though obviously these are related, especially memory and executive function.) Then I would guess that Claude’s executive function actually isn’t that bad and might even be ≥human level. But it’s hard to say because the memory—especially visual memory—really does seem worse than a 6 year old’s.
I think that probably internet access would help substantially.
It would be so awesome to have such a stream as additional reference point—just one six year old without internet and external help doing a Pokemon run