--It can predict random internet text better than the best humans
I wouldn’t use this metric. I don’t see how to map between it and anything we care about. If it’s defined in terms of accuracy when predicting the next word, I won’t be surprised if existing language models already outperform humans.
Also, I find the term “human-level AGI” confusing. Does it exclude systems that are super-human on some dimensions? If so, it seems too narrow to be useful. For the purpose of this post, I propose using the following definition: A system that is able to generate text in a way that allows to automatically perform any task that humans can perform by writing text.
Nevertheless, it works. That’s how self-supervised training/pretraining works.
Right, I’m just saying that I don’t see how to map that metric to things we care about in the context of AI safety. If a language model outperforms humans at predicting the next word, maybe it’s just due to it being sufficiently superior at modeling low-level stuff (e.g. GPT-3 may be better than me at predicting you’ll write “That’s” rather than “That is”.)
(As an aside, in the linked footnote I couldn’t easily spot any paper that actually evaluated humans on predicting the next word.)
The LAMBADA dataset was also constructed using humans to predict the missing words, but GPT-3 falls far short of perfection there, so while I can’t numerically answer it (unless you trust OA’s reasoning there), it is still very clear that GPT-3 does not match or surpass humans at text prediction.
GPT-2 was benchmarked at 43 perplexity on the 1 Billion Word (1BW) benchmark vs a (highly extrapolated) human perplexity of 12
I wouldn’t say that that paper shows a (highly extrapolated) human perplexity of 12. It compares human-written sentences to language model generated sentences on the degree to which they seem “clearly human” vs “clearly unhuman” as judged by humans. Amusingly, for every 8 human-written sentences that were judged as “clearly human”, one human-written sentence was judged as “clearly unhuman”. And that 8:1 ratio is the thing from which human perplexity is being derived from. This doesn’t make sense to me.
If the human annotators in this paper had never annotated human-written sentences as “clearly unhuman”, this extrapolation would have shown human perplexity of 1! (As if humans can magically predict an entire page of text sampled from the internet.)
The LAMBADA dataset was also constructed using humans to predict the missing words, but GPT-3 falls far short of perfection there, so while I can’t numerically answer it (unless you trust OA’s reasoning there), it is still very clear that GPT-3 does not match or surpass humans at text prediction.
If the comparison here is on the final LAMBADA dataset, after examples were filtered out based on disagreement between humans (as you mentioned in the newsletter), then it’s an unfair comparison. The examples are selected for being easy for humans.
BTW, I think the comparison to humans on the LAMBADA dataset is indeed interesting in the context of AI safety (more so than “predict the next word in a random internet text”); because I don’t expect the perplexity/accuracy to depend much on the ability to model very low-level stuff (e.g. “that’s” vs “that is”).
Yeah, human-level is supposed to mean not strongly superhuman at anything important, while also not being strongly subhuman in anything important.
I think that’s roughly the concept Nick Bostrom used in Superintelligence when discussing takeoff dynamics. (The usage of that concept is my only major disagreement with that book.) IMO it would be very surprising if the first ML system that is not strongly subhuman at anything important would not be strongly superhuman at anything important (assuming this property is not optimized for).
The most capable humans are often much more capable then the average and thus not superhuman. I remember the example of a hacker who gave a talk at the CCC about how he was in vacation in Taiwan and hacked their electronic payment system on the side. If you could scale him up 10,000 or 100,000 times the kind of cyberwar you could wage would be enormous.
Some quick thoughts/comments:
I wouldn’t use this metric. I don’t see how to map between it and anything we care about. If it’s defined in terms of accuracy when predicting the next word, I won’t be surprised if existing language models already outperform humans.
Also, I find the term “human-level AGI” confusing. Does it exclude systems that are super-human on some dimensions? If so, it seems too narrow to be useful. For the purpose of this post, I propose using the following definition: A system that is able to generate text in a way that allows to automatically perform any task that humans can perform by writing text.
Nevertheless, it works. That’s how self-supervised training/pretraining works.
They don’t. GPT-3 is still, as far as I can tell, about twice as bad in an absolute sense as humans in text prediction: https://www.gwern.net/Scaling-hypothesis#fn18
Right, I’m just saying that I don’t see how to map that metric to things we care about in the context of AI safety. If a language model outperforms humans at predicting the next word, maybe it’s just due to it being sufficiently superior at modeling low-level stuff (e.g. GPT-3 may be better than me at predicting you’ll write “That’s” rather than “That is”.)
(As an aside, in the linked footnote I couldn’t easily spot any paper that actually evaluated humans on predicting the next word.)
Third paragraph:
https://www.gwern.net/docs/ai/2017-shen.pdf
The LAMBADA dataset was also constructed using humans to predict the missing words, but GPT-3 falls far short of perfection there, so while I can’t numerically answer it (unless you trust OA’s reasoning there), it is still very clear that GPT-3 does not match or surpass humans at text prediction.
I wouldn’t say that that paper shows a (highly extrapolated) human perplexity of 12. It compares human-written sentences to language model generated sentences on the degree to which they seem “clearly human” vs “clearly unhuman” as judged by humans. Amusingly, for every 8 human-written sentences that were judged as “clearly human”, one human-written sentence was judged as “clearly unhuman”. And that 8:1 ratio is the thing from which human perplexity is being derived from. This doesn’t make sense to me.
If the human annotators in this paper had never annotated human-written sentences as “clearly unhuman”, this extrapolation would have shown human perplexity of 1! (As if humans can magically predict an entire page of text sampled from the internet.)
If the comparison here is on the final LAMBADA dataset, after examples were filtered out based on disagreement between humans (as you mentioned in the newsletter), then it’s an unfair comparison. The examples are selected for being easy for humans.
BTW, I think the comparison to humans on the LAMBADA dataset is indeed interesting in the context of AI safety (more so than “predict the next word in a random internet text”); because I don’t expect the perplexity/accuracy to depend much on the ability to model very low-level stuff (e.g. “that’s” vs “that is”).
OK, fair enough.
Yeah, human-level is supposed to mean not strongly superhuman at anything important, while also not being strongly subhuman in anything important.
I think that’s roughly the concept Nick Bostrom used in Superintelligence when discussing takeoff dynamics. (The usage of that concept is my only major disagreement with that book.) IMO it would be very surprising if the first ML system that is not strongly subhuman at anything important would not be strongly superhuman at anything important (assuming this property is not optimized for).
Yeah, I think I agree with that. Nice.
The most capable humans are often much more capable then the average and thus not superhuman. I remember the example of a hacker who gave a talk at the CCC about how he was in vacation in Taiwan and hacked their electronic payment system on the side. If you could scale him up 10,000 or 100,000 times the kind of cyberwar you could wage would be enormous.