The correct answer is the annoyingly trivial one: “it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText.”
How good is that, though? Well, it depends entirely on how good you think transformer LMs are capable of being, in principle.
If you’re Gary Marcus and you think transformer LMs will always suck in some ways, then you think the 1.69 model will also suck in those ways. Whereas, if you think a perfect transformer LM would be an AGI (even if only trained on MassiveText-like data), then you think the 1.69 model would be an AGI. Both of these people are right, conditional on their other beliefs.
The key distinction here is that “1.69 loss” may not the best achievableloss on this dataset. It’s just an estimate of the best loss achievableby this kind of model.
The question “what would a model be like, if it got the best achievable loss, period?” is more interesting, but nothing in this post or these papers really touches on it.
it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText
Transformers a Turing complete, so “model of this type” is not much of a constraint. On the other hand, I guess it’s theoretically possible that some weight matrices are inaccessible to current training algorithms no matter how much compute and data we have. It seems also possible that the scaling law doesn’t go on forever, but phase-transitions somewhere (maybe very far) to a new trend which goes below the “irreducible” term.
The correct answer is the annoyingly trivial one: “it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText.”
How good is that, though? Well, it depends entirely on how good you think transformer LMs are capable of being, in principle.
If you’re Gary Marcus and you think transformer LMs will always suck in some ways, then you think the 1.69 model will also suck in those ways. Whereas, if you think a perfect transformer LM would be an AGI (even if only trained on MassiveText-like data), then you think the 1.69 model would be an AGI. Both of these people are right, conditional on their other beliefs.
The key distinction here is that “1.69 loss” may not the best achievable loss on this dataset. It’s just an estimate of the best loss achievable by this kind of model.
The question “what would a model be like, if it got the best achievable loss, period?” is more interesting, but nothing in this post or these papers really touches on it.
Transformers a Turing complete, so “model of this type” is not much of a constraint. On the other hand, I guess it’s theoretically possible that some weight matrices are inaccessible to current training algorithms no matter how much compute and data we have. It seems also possible that the scaling law doesn’t go on forever, but phase-transitions somewhere (maybe very far) to a new trend which goes below the “irreducible” term.