I agree that the difference in datasets between 1BW and PTB is making precise comparisons impossible. Also, the “human perplexity = 12” on 1BW is not measured directly. It’s extrapolated from their constructed “human judgement score” metric based on values of both “human judgement score” and perplexity metrics for pre-2017 language models, with authors noting that the extrapolation is unreliable.
I agree that the difference in datasets between 1BW and PTB is making precise comparisons impossible. Also, the “human perplexity = 12” on 1BW is not measured directly. It’s extrapolated from their constructed “human judgement score” metric based on values of both “human judgement score” and perplexity metrics for pre-2017 language models, with authors noting that the extrapolation is unreliable.