Hmm, yeah, I phrased that point really badly. I’ll go back and rewrite it.
A clearer version of the sentence might read:
“Only PaLM is remotely close to Chinchilla here, mostly because it trained on a larger number of tokens than the other non-Chinchilla models, plus a small (!) boost from its larger size.”
For instance, if you look at the loss improvement from Gopher to PaLM, 85% of it comes from the increase in data alone, and only 15% from the increase in model size. This is what I meant when I said that PaLM only got a “small” boost from its larger size.
Hmm, yeah, I phrased that point really badly. I’ll go back and rewrite it.
A clearer version of the sentence might read:
For instance, if you look at the loss improvement from Gopher to PaLM, 85% of it comes from the increase in data alone, and only 15% from the increase in model size. This is what I meant when I said that PaLM only got a “small” boost from its larger size.
EDIT: rewrote and expanded this part of the post.