This paper seems to be a good summary and puts a lower bound on entropy of human models of english somewhere between 0.65 and 1.10 BPC. If I had to guess, the real number is probably closer 0.8-1.0 BPC as the mentioned paper was able to pull up the lower bound for hebrew by about 0.2 BPC. Assuming that regular english compresses to an average of 4* tokens per character, GPT-3 clocks in at 1.73/ln(2)/4 = 0.62 BPC. This is lower than the lower bound mentioned in the paper.
So, am I right in thinking that if someone took random internet text and fed it to me word by word and asked me to predict the next word, I’d do about as well as GPT-2 and significantly worse than GPT-3?
That would also be my guess. In terms of data entropy, I think GPT-3 is probably already well into the superhuman realm.
I suspect this is mainly because GPT-3 is much better at modelling “high frequency” patterns and features in text that account for a lot of the entropy, but that humans ignore because they have low mutual information with the things humans care about. OTOH, GPT-3 also has extensive knowledge of pretty much everything, so it might be leveraging that and other things to make better predictions than you.
Hmmm, your answer contradicts Gwern’s answer. I had no idea my question would be so controversial! I’m glad I asked, and I hope the controversy resolves itself eventually...
Just use bleeding edge tech to analyze ancient knowledge from the god of information theory himself.
This paper seems to be a good summary and puts a lower bound on entropy of human models of english somewhere between 0.65 and 1.10 BPC. If I had to guess, the real number is probably closer 0.8-1.0 BPC as the mentioned paper was able to pull up the lower bound for hebrew by about 0.2 BPC. Assuming that regular english compresses to an average of 4* tokens per character, GPT-3 clocks in at 1.73/ln(2)/4 = 0.62 BPC. This is lower than the lower bound mentioned in the paper.
That would also be my guess. In terms of data entropy, I think GPT-3 is probably already well into the superhuman realm.
I suspect this is mainly because GPT-3 is much better at modelling “high frequency” patterns and features in text that account for a lot of the entropy, but that humans ignore because they have low mutual information with the things humans care about. OTOH, GPT-3 also has extensive knowledge of pretty much everything, so it might be leveraging that and other things to make better predictions than you.
This is similar to what we see with autoregressive image and audio models, where high frequency features are fairly well modelled, but you need a really strong model to also get the low frequency stuff right.
*(ask Gwern for details, this is the number I got in my own experiments with the tokenizer)
Hmmm, your answer contradicts Gwern’s answer. I had no idea my question would be so controversial! I’m glad I asked, and I hope the controversy resolves itself eventually...
From that paper:
> A new improved method for evaluation of both lower and upper bounds of the entropy of printed texts is developed.
”Printed texts” probably falls a standard deviation or three above the median human’s performance. It’s subject to some fairly severe sampling bias.