Nevertheless, it works. That’s how self-supervised training/pretraining works.
Right, I’m just saying that I don’t see how to map that metric to things we care about in the context of AI safety. If a language model outperforms humans at predicting the next word, maybe it’s just due to it being sufficiently superior at modeling low-level stuff (e.g. GPT-3 may be better than me at predicting you’ll write “That’s” rather than “That is”.)
(As an aside, in the linked footnote I couldn’t easily spot any paper that actually evaluated humans on predicting the next word.)
The LAMBADA dataset was also constructed using humans to predict the missing words, but GPT-3 falls far short of perfection there, so while I can’t numerically answer it (unless you trust OA’s reasoning there), it is still very clear that GPT-3 does not match or surpass humans at text prediction.
GPT-2 was benchmarked at 43 perplexity on the 1 Billion Word (1BW) benchmark vs a (highly extrapolated) human perplexity of 12
I wouldn’t say that that paper shows a (highly extrapolated) human perplexity of 12. It compares human-written sentences to language model generated sentences on the degree to which they seem “clearly human” vs “clearly unhuman” as judged by humans. Amusingly, for every 8 human-written sentences that were judged as “clearly human”, one human-written sentence was judged as “clearly unhuman”. And that 8:1 ratio is the thing from which human perplexity is being derived from. This doesn’t make sense to me.
If the human annotators in this paper had never annotated human-written sentences as “clearly unhuman”, this extrapolation would have shown human perplexity of 1! (As if humans can magically predict an entire page of text sampled from the internet.)
The LAMBADA dataset was also constructed using humans to predict the missing words, but GPT-3 falls far short of perfection there, so while I can’t numerically answer it (unless you trust OA’s reasoning there), it is still very clear that GPT-3 does not match or surpass humans at text prediction.
If the comparison here is on the final LAMBADA dataset, after examples were filtered out based on disagreement between humans (as you mentioned in the newsletter), then it’s an unfair comparison. The examples are selected for being easy for humans.
BTW, I think the comparison to humans on the LAMBADA dataset is indeed interesting in the context of AI safety (more so than “predict the next word in a random internet text”); because I don’t expect the perplexity/accuracy to depend much on the ability to model very low-level stuff (e.g. “that’s” vs “that is”).
Right, I’m just saying that I don’t see how to map that metric to things we care about in the context of AI safety. If a language model outperforms humans at predicting the next word, maybe it’s just due to it being sufficiently superior at modeling low-level stuff (e.g. GPT-3 may be better than me at predicting you’ll write “That’s” rather than “That is”.)
(As an aside, in the linked footnote I couldn’t easily spot any paper that actually evaluated humans on predicting the next word.)
Third paragraph:
https://www.gwern.net/docs/ai/2017-shen.pdf
The LAMBADA dataset was also constructed using humans to predict the missing words, but GPT-3 falls far short of perfection there, so while I can’t numerically answer it (unless you trust OA’s reasoning there), it is still very clear that GPT-3 does not match or surpass humans at text prediction.
I wouldn’t say that that paper shows a (highly extrapolated) human perplexity of 12. It compares human-written sentences to language model generated sentences on the degree to which they seem “clearly human” vs “clearly unhuman” as judged by humans. Amusingly, for every 8 human-written sentences that were judged as “clearly human”, one human-written sentence was judged as “clearly unhuman”. And that 8:1 ratio is the thing from which human perplexity is being derived from. This doesn’t make sense to me.
If the human annotators in this paper had never annotated human-written sentences as “clearly unhuman”, this extrapolation would have shown human perplexity of 1! (As if humans can magically predict an entire page of text sampled from the internet.)
If the comparison here is on the final LAMBADA dataset, after examples were filtered out based on disagreement between humans (as you mentioned in the newsletter), then it’s an unfair comparison. The examples are selected for being easy for humans.
BTW, I think the comparison to humans on the LAMBADA dataset is indeed interesting in the context of AI safety (more so than “predict the next word in a random internet text”); because I don’t expect the perplexity/accuracy to depend much on the ability to model very low-level stuff (e.g. “that’s” vs “that is”).