That is, I suspect humans could be trained to perform very well, in the usual sense of “training” for humans where not too much data/time is necessary.
I paid people to try to get good at this game, and also various smart people like Paul Christiano tried it for a few hours, and everyone was still notably worse than GPT-2-sm (about the size of GPT-1).
I expect I would improve significantly with additional practice (e.g. I think a 2nd hour of playing the probability-assignment game would get a much higher score than my 1st in expectation). My subjective feeling was that I could probably learn to do as well as GPT-2-small (though estimated super noisily) but there’s definitely no way I was going to get close to GPT-2.
I’m wary of the assumption that we can judge “human ability” on a novel task X by observing performance after an hour of practice.
There are some tasks where performance improves with practice but plateaus within one hour. I’m thinking of relatively easy video games. Or relatively easy games in general, like casual card/board/party games with simple rules and optimal policies. But most interesting things that humans “can do” take much longer to learn than this.
Here are some things that humans “can do,” but require >> 1 hour of practice to “do,” while still requiring far less exposure to task-specific example data than we’re used to in ML:
Superforecasting
Reporting calibrated numeric credences, a prerequisite for both superforecasting and the GPT game (does this take >> 1 hour? I would guess so, but I’m not sure)
Playing video/board/card games of nontrivial difficulty or depth
Speaking any given language, even when learned during the critical language acquisition period
Driving motor vehicles like cars (arguably) and planes (definitely)
Writing good prose, for any conventional sense of “good” in any genre/style
Juggling
Computer programming (with any proficiency, and certainly e.g. competitive programming)
Doing homework-style problems in math or physics
Acquiring and applying significant factual knowledge in academic subjects like law or history
The last 3 examples are the same ones Owain_Evans mentioned in another thread, as examples of things LMs can do “pretty well on.”
If we only let the humans practice for an hour, we’ll conclude that humans “cannot do” these tasks at the level of current LMs either, which seems clearly wrong (that is, inconsistent with the common-sense reading of terms like “human performance”).
Ok, sounds like you’re using “not too much data/time” in a different sense than I was thinking of; I suspect we don’t disagree. My current guess is that some humans could beat GPT-1 with ten hours of practice, but that GPT-2 or larger would be extremely difficult or and plausibly impossible with any amount of practice.
The human brain internally is performing very similar computations to transformer LLMs—as expected from all the prior research indicating strong similarity between DL vision features and primate vision—but that doesn’t mean we can immediately extract those outputs and apply them towards game performance.
It could be useful to look at performance of GPT-3 on foreign languages. We know roughly how long it takes humans to reach a given level at a foreign language. E.g. You might find GPT-3 is at a level on 15 different languages that would take a smart human (say) 30 months to achieve (2 months per language). Foreign languages are just a small fraction of the training data.
I think I remember seeing somewhere that LLMs learn more slowly on languages with “more complex” grammar (in the sense of their loss decreasing more slowly per the same number of tokens) but I can’t find the source right now.
I paid people to try to get good at this game, and also various smart people like Paul Christiano tried it for a few hours, and everyone was still notably worse than GPT-2-sm (about the size of GPT-1).
EDIT: These results are now posted here.
I expect I would improve significantly with additional practice (e.g. I think a 2nd hour of playing the probability-assignment game would get a much higher score than my 1st in expectation). My subjective feeling was that I could probably learn to do as well as GPT-2-small (though estimated super noisily) but there’s definitely no way I was going to get close to GPT-2.
I’m wary of the assumption that we can judge “human ability” on a novel task X by observing performance after an hour of practice.
There are some tasks where performance improves with practice but plateaus within one hour. I’m thinking of relatively easy video games. Or relatively easy games in general, like casual card/board/party games with simple rules and optimal policies. But most interesting things that humans “can do” take much longer to learn than this.
Here are some things that humans “can do,” but require >> 1 hour of practice to “do,” while still requiring far less exposure to task-specific example data than we’re used to in ML:
Superforecasting
Reporting calibrated numeric credences, a prerequisite for both superforecasting and the GPT game (does this take >> 1 hour? I would guess so, but I’m not sure)
Playing video/board/card games of nontrivial difficulty or depth
Speaking any given language, even when learned during the critical language acquisition period
Driving motor vehicles like cars (arguably) and planes (definitely)
Writing good prose, for any conventional sense of “good” in any genre/style
Juggling
Computer programming (with any proficiency, and certainly e.g. competitive programming)
Doing homework-style problems in math or physics
Acquiring and applying significant factual knowledge in academic subjects like law or history
The last 3 examples are the same ones Owain_Evans mentioned in another thread, as examples of things LMs can do “pretty well on.”
If we only let the humans practice for an hour, we’ll conclude that humans “cannot do” these tasks at the level of current LMs either, which seems clearly wrong (that is, inconsistent with the common-sense reading of terms like “human performance”).
Ok, sounds like you’re using “not too much data/time” in a different sense than I was thinking of; I suspect we don’t disagree. My current guess is that some humans could beat GPT-1 with ten hours of practice, but that GPT-2 or larger would be extremely difficult or and plausibly impossible with any amount of practice.
The human brain internally is performing very similar computations to transformer LLMs—as expected from all the prior research indicating strong similarity between DL vision features and primate vision—but that doesn’t mean we can immediately extract those outputs and apply them towards game performance.
It could be useful to look at performance of GPT-3 on foreign languages. We know roughly how long it takes humans to reach a given level at a foreign language. E.g. You might find GPT-3 is at a level on 15 different languages that would take a smart human (say) 30 months to achieve (2 months per language). Foreign languages are just a small fraction of the training data.
I think I remember seeing somewhere that LLMs learn more slowly on languages with “more complex” grammar (in the sense of their loss decreasing more slowly per the same number of tokens) but I can’t find the source right now.