You write “Only PaLM looks better than Chinchilla here, mostly because it trained on 780B tokens instead of 300B or fewer, plus a small (!) boost from its larger size.”
But earlier you write:
“Chinchilla is a model with the same training compute cost as Gopher, allocated more evenly between the two terms in the equation.
Hmm, yeah, I phrased that point really badly. I’ll go back and rewrite it.
A clearer version of the sentence might read:
“Only PaLM is remotely close to Chinchilla here, mostly because it trained on a larger number of tokens than the other non-Chinchilla models, plus a small (!) boost from its larger size.”
For instance, if you look at the loss improvement from Gopher to PaLM, 85% of it comes from the increase in data alone, and only 15% from the increase in model size. This is what I meant when I said that PaLM only got a “small” boost from its larger size.
Confusion:
You write “Only PaLM looks better than Chinchilla here, mostly because it trained on 780B tokens instead of 300B or fewer, plus a small (!) boost from its larger size.”
But earlier you write:
“Chinchilla is a model with the same training compute cost as Gopher, allocated more evenly between the two terms in the equation.
It’s 70B params, trained on 1.4T tokens of data”
300B vs. 1.4T. Is this an error?
Hmm, yeah, I phrased that point really badly. I’ll go back and rewrite it.
A clearer version of the sentence might read:
For instance, if you look at the loss improvement from Gopher to PaLM, 85% of it comes from the increase in data alone, and only 15% from the increase in model size. This is what I meant when I said that PaLM only got a “small” boost from its larger size.
EDIT: rewrote and expanded this part of the post.
I think that in that first sentence, OP is comparing PaLM to other large LMs rather than to Chinchilla.