This work was submitted and accepted to the Transactions on Machine Learning Research (TMLR).
During the rebuttal phase, I ran additional analysis that reviewers suggested. I found that:
A big reason why humans suck at NTP in our experiments is that they suck at tokenization:
Our experiments were done on a dataset that has some overlap with LLM train sets, but this probably has only a small effect:
A potential reason why humans have terrible perplexity is that they aren’t good at having fine-grained probability when doing comparison of likelihood between tokens:
Overall, I think the original results reported in the paper were slightly overstated. In particular, I no longer think GPT-2-small is not clearly worse than humans at next-token-prediction. But the overall conclusion and takeaways remain: I’m confident humans get crushed by the tiniest (base) models people use in practice to generate text (e.g. StableLM-1.6B).
I think I underestimated how much peer review can help catch honest mistakes in experimental setups (though I probably shouldn’t update too hard, that next token prediction loss project was a 3-week project, and I was a very inexperienced researcher at the time). Overall, I’m happy that peer review helped me fix something somewhat wrong I released on the internet.
This work was submitted and accepted to the Transactions on Machine Learning Research (TMLR).
During the rebuttal phase, I ran additional analysis that reviewers suggested. I found that:
A big reason why humans suck at NTP in our experiments is that they suck at tokenization:
Our experiments were done on a dataset that has some overlap with LLM train sets, but this probably has only a small effect:
A potential reason why humans have terrible perplexity is that they aren’t good at having fine-grained probability when doing comparison of likelihood between tokens:
The updated paper can be found on arxiv: https://arxiv.org/pdf/2212.11281
Overall, I think the original results reported in the paper were slightly overstated. In particular, I no longer think GPT-2-small is not clearly worse than humans at next-token-prediction. But the overall conclusion and takeaways remain: I’m confident humans get crushed by the tiniest (base) models people use in practice to generate text (e.g. StableLM-1.6B).
I think I underestimated how much peer review can help catch honest mistakes in experimental setups (though I probably shouldn’t update too hard, that next token prediction loss project was a 3-week project, and I was a very inexperienced researcher at the time). Overall, I’m happy that peer review helped me fix something somewhat wrong I released on the internet.