gwern comments on If GPT-6 is human-level AGI but costs $200 per page of output, what would happen?

gwern 9 Oct 2020 16:35 UTC
13 points

I wouldn’t use this metric. I don’t see how to map between it and anything we care about.

Nevertheless, it works. That’s how self-supervised training/pretraining works.

If it’s defined in terms of accuracy when predicting the next word, I won’t be surprised if existing language models already outperforms humans.

They don’t. GPT-3 is still, as far as I can tell, about twice as bad in an absolute sense as humans in text prediction: https://www.gwern.net/Scaling-hypothesis#fn18
- Ofer 9 Oct 2020 18:35 UTC
  3 points
  Parent
  
  Nevertheless, it works. That’s how self-supervised training/pretraining works.
  
  Right, I’m just saying that I don’t see how to map that metric to things we care about in the context of AI safety. If a language model outperforms humans at predicting the next word, maybe it’s just due to it being sufficiently superior at modeling low-level stuff (e.g. GPT-3 may be better than me at predicting you’ll write “That’s” rather than “That is”.)
  
  (As an aside, in the linked footnote I couldn’t easily spot any paper that actually evaluated humans on predicting the next word.)
  - gwern 9 Oct 2020 19:44 UTC
    4 points
    Parent
    
    (As an aside, in the linked footnote I couldn’t easily spot any paper that actually evaluated humans on predicting the next word.)
    
    Third paragraph:
    
    GPT-2 was benchmarked at 43 perplexity on the 1 Billion Word (1BW) benchmark vs a (highly extrapolated) human perplexity of 12
    
    https://www.gwern.net/docs/ai/2017-shen.pdf
    
    The LAMBADA dataset was also constructed using humans to predict the missing words, but GPT-3 falls far short of perfection there, so while I can’t numerically answer it (unless you trust OA’s reasoning there), it is still very clear that GPT-3 does not match or surpass humans at text prediction.
    - Ofer 10 Oct 2020 17:23 UTC
      3 points
      Parent
      
      GPT-2 was benchmarked at 43 perplexity on the 1 Billion Word (1BW) benchmark vs a (highly extrapolated) human perplexity of 12
      
      I wouldn’t say that that paper shows a (highly extrapolated) human perplexity of 12. It compares human-written sentences to language model generated sentences on the degree to which they seem “clearly human” vs “clearly unhuman” as judged by humans. Amusingly, for every 8 human-written sentences that were judged as “clearly human”, one human-written sentence was judged as “clearly unhuman”. And that 8:1 ratio is the thing from which human perplexity is being derived from. This doesn’t make sense to me.
      
      If the human annotators in this paper had never annotated human-written sentences as “clearly unhuman”, this extrapolation would have shown human perplexity of 1! (As if humans can magically predict an entire page of text sampled from the internet.)
      
      The LAMBADA dataset was also constructed using humans to predict the missing words, but GPT-3 falls far short of perfection there, so while I can’t numerically answer it (unless you trust OA’s reasoning there), it is still very clear that GPT-3 does not match or surpass humans at text prediction.
      
      If the comparison here is on the final LAMBADA dataset, after examples were filtered out based on disagreement between humans (as you mentioned in the newsletter), then it’s an unfair comparison. The examples are selected for being easy for humans.
      
      BTW, I think the comparison to humans on the LAMBADA dataset is indeed interesting in the context of AI safety (more so than “predict the next word in a random internet text”); because I don’t expect the perplexity/accuracy to depend much on the ability to model very low-level stuff (e.g. “that’s” vs “that is”).