Next up would be some notion of abstraction or coarse-graining. I might quickly make mistakes in predicting the probability distribution over next letters, but somehow I can make long-term predictions about abstract properties of the text. Yes, these get trained by purely predicting the next word, but I think they’re still useful to measure separately, because if we were better at measuring them we would see that they lag behind progress in local perplexity, because large-scale patterns put a lot more demands on transformers’ dataset size and model size (if densely looking at long stretches of text, yes I know it’s linear, but the coefficient is large).
Next up would be some notion of abstraction or coarse-graining. I might quickly make mistakes in predicting the probability distribution over next letters, but somehow I can make long-term predictions about abstract properties of the text. Yes, these get trained by purely predicting the next word, but I think they’re still useful to measure separately, because if we were better at measuring them we would see that they lag behind progress in local perplexity, because large-scale patterns put a lot more demands on transformers’ dataset size and model size (if densely looking at long stretches of text, yes I know it’s linear, but the coefficient is large).