This graph reports relative performance of different size models. We know the sizes of Nano 1 and Nano 2, so this is a massive hint given how scaling laws work for the size of Pro and Ultra.
The weird thing is, I looked at this graph, I did that (because of course), and I got insane results. Applying that approach, Pro has somewhere ~13B parameters, and Ultra has somewhere ~26B parameters. That’s from just eyeballing the step sizes and doing mental arithmetic, but the results are crazy enough that I’m not going to bother breaking out a ruler and calculator to try to do a more detailed estimate. It does not take a significant proportion of all of Google’s latest TPUs months to train an ~26B parameter model, and if Google can really get narrowly-beating-GPT-4-level performance out of a model only the size of a mid-sized-LLama, or 3–4 times the size of Mistral, then I am deeply impressed by their mad skillz. Also, you don’t call something “Nano” if it’s 1⁄8 or 1⁄16 the size of your biggest model: the parameter count ratios have to be a lot larger than that between a data-center model and a phone model. I don’t believe those sizes for a moment, so my conclusion is that the reason this graph was released is exactly because it’s effectively on some sort of distinctly screwy scale that makes it impossible to use it to meaningfully estimate Pro and Ultra’s parameter counts. I think it’s marketing shlock, probably aimed at Pixel phone buyers.
My first guess would be that Nano 1 and Nano 2 were not originally pretrained at those parameter counts, but have been heavilydistilled down (or some newer whizzier equivalent of distillation) in certain fairly narrow skillsets or even specific use-cases valuable for their on-Pixel-phone role (because they’d be dumb not to, and DeepMind+theBrain aren’t dumb), and that those tests, while their 1-word descriptions sound general (well, apart from “summarization”), were carefully arranged to be inside those fairly narrow skillsets. (It also wouldn’t surprise me if Nano 1 & 2 came in language-specific versions, each separately distilled, and that graph is from the EN versions. Though one of the test names was “multilingual”, and those versions wouldn’t be entirely monolingual, they’d still have some basic translation ability to/from useful languages.) So I suspect the tests are designed to heavily favor the Nano models. Or, the scoring is arranged in some way such that the scale is distinctly non-linear. Or indeed quite possibly some cunning combination of both of these. I’m fairly sure that if Jeff Dean (or one of his senior PMs) had sent me a sketch of what that graph needed to look like, say, a few months ago and assigned me a team of researchers, we could have come up with a set of measurements that looked like that.
It also seems likely that the Nano models are extremely overtrained compared to the scaling laws. The scaling laws are for optimal compute during training, but here they want to minimize inference cost so it would make sense to train for significantly longer.
Agreed (well, except for a nitpick that post-Chinchilla versions of scaling laws also make predictions for scaling data and parameter count separately, including in overtraining regions): overtraining during distillation seems like the obvious approach, using a lot of data (possibly much of it synthetic, which would let you avoid issues like memorization of PII and copyright) rather than many epochs, in order to minimize memorization. Using distillation also effectively increases the size of your distillation training set for scaling laws, since the trainee model now gets more data per example: not just the tokens in the correct answer, but their logits and those of all the top alternative tokens according to the larger trainer model. So each document in the distillation training set becomes worth several times as much.
What assumptions is this making about scaling laws for these benchmarks? I wouldn’t know how to convert laws for losses into these kind of fuzzy benchmarks.
I was simply looking at the average across bar charts of the improvement step size between Nano 1 at 1.8B and Nano 2 at 3.25B parameter count, a factor of ~2 in parameter count. The step size up to Pro is about twice that, then from Pro up to Ultra about the same as Nano 1 to Nano 2, each again averaged across bar chats. Assuming only that the scaling law is a power law, i.e a straight line on a log-linear graph, as they always are (well, unless you use a screwy nonlinear scoring scheme), that would mean that Pro’s parameter count was around 22=4 times that of Nano 2, which is ~13B, and Ultra was another doubling again at ~26B. Those numbers are obviously wrong, so this is not a simple log-linear chart of something that just scales with a power law from parameter count. And I’m sure they wouldn’t have published it if it was.
The weird thing is, I looked at this graph, I did that (because of course), and I got insane results. Applying that approach, Pro has somewhere ~13B parameters, and Ultra has somewhere ~26B parameters. That’s from just eyeballing the step sizes and doing mental arithmetic, but the results are crazy enough that I’m not going to bother breaking out a ruler and calculator to try to do a more detailed estimate. It does not take a significant proportion of all of Google’s latest TPUs months to train an ~26B parameter model, and if Google can really get narrowly-beating-GPT-4-level performance out of a model only the size of a mid-sized-LLama, or 3–4 times the size of Mistral, then I am deeply impressed by their mad skillz. Also, you don’t call something “Nano” if it’s 1⁄8 or 1⁄16 the size of your biggest model: the parameter count ratios have to be a lot larger than that between a data-center model and a phone model. I don’t believe those sizes for a moment, so my conclusion is that the reason this graph was released is exactly because it’s effectively on some sort of distinctly screwy scale that makes it impossible to use it to meaningfully estimate Pro and Ultra’s parameter counts. I think it’s marketing shlock, probably aimed at Pixel phone buyers.
My first guess would be that Nano 1 and Nano 2 were not originally pretrained at those parameter counts, but have been heavily distilled down (or some newer whizzier equivalent of distillation) in certain fairly narrow skillsets or even specific use-cases valuable for their on-Pixel-phone role (because they’d be dumb not to, and DeepMind+theBrain aren’t dumb), and that those tests, while their 1-word descriptions sound general (well, apart from “summarization”), were carefully arranged to be inside those fairly narrow skillsets. (It also wouldn’t surprise me if Nano 1 & 2 came in language-specific versions, each separately distilled, and that graph is from the EN versions. Though one of the test names was “multilingual”, and those versions wouldn’t be entirely monolingual, they’d still have some basic translation ability to/from useful languages.) So I suspect the tests are designed to heavily favor the Nano models. Or, the scoring is arranged in some way such that the scale is distinctly non-linear. Or indeed quite possibly some cunning combination of both of these. I’m fairly sure that if Jeff Dean (or one of his senior PMs) had sent me a sketch of what that graph needed to look like, say, a few months ago and assigned me a team of researchers, we could have come up with a set of measurements that looked like that.
It also seems likely that the Nano models are extremely overtrained compared to the scaling laws. The scaling laws are for optimal compute during training, but here they want to minimize inference cost so it would make sense to train for significantly longer.
Agreed (well, except for a nitpick that post-Chinchilla versions of scaling laws also make predictions for scaling data and parameter count separately, including in overtraining regions): overtraining during distillation seems like the obvious approach, using a lot of data (possibly much of it synthetic, which would let you avoid issues like memorization of PII and copyright) rather than many epochs, in order to minimize memorization. Using distillation also effectively increases the size of your distillation training set for scaling laws, since the trainee model now gets more data per example: not just the tokens in the correct answer, but their logits and those of all the top alternative tokens according to the larger trainer model. So each document in the distillation training set becomes worth several times as much.
What assumptions is this making about scaling laws for these benchmarks? I wouldn’t know how to convert laws for losses into these kind of fuzzy benchmarks.
I was simply looking at the average across bar charts of the improvement step size between Nano 1 at 1.8B and Nano 2 at 3.25B parameter count, a factor of ~2 in parameter count. The step size up to Pro is about twice that, then from Pro up to Ultra about the same as Nano 1 to Nano 2, each again averaged across bar chats. Assuming only that the scaling law is a power law, i.e a straight line on a log-linear graph, as they always are (well, unless you use a screwy nonlinear scoring scheme), that would mean that Pro’s parameter count was around 22=4 times that of Nano 2, which is ~13B, and Ultra was another doubling again at ~26B. Those numbers are obviously wrong, so this is not a simple log-linear chart of something that just scales with a power law from parameter count. And I’m sure they wouldn’t have published it if it was.