You calculated things for the neural network brain size anchor; now here’s the peformance scaling trend calculation (I think):
I took these graphs from the Chinchilla paper and then made them transparent and superimposed them on one another and then made a copy on the right to extend the line. And I drew some other lines to extend them.
Eyeballing this graph it looks like whatever performance we could achieve with 10^27 FLOPs under the Kaplan scaling laws, we can now achieve with 10^25 FLOPs. (!!!) This is a big deal if true. Am I reasoning incorrectly here?
If this is anywhere close to correct, then the distinction you mention between two methods of getting timelines—“Assume it happens when we train a brain-sized model compute-optimally” vs. “assume it happens when we get to superhuman performance on this ensemble of benchmarks that we already have GPT trends for” becomes even more exciting and important than I thought! It’s like, a huge huge crux, because it basically makes for a 4 OOM difference!
EDIT: To be clear, if this is true then I think I should update away from the second method, on the grounds that it predicts we are only about 1 OOM away and that seems implausible.
First I gotta say: I thought I knew the art of doing quick-and-dirty calculations, but holy crap, this methodology is quick-and-dirty-ier than I would ever have thought of. I’m impressed.
But I don’t think it currently gets to right answer. One salient thing: it doesn’t take into account Kaplan’s “contradiction”. I.e., Kaplan’s laws already suggested that once we were using enough FLOP, we would have to scale data faster than we have to do in the short term. So when I made my extrapolations, I used a data-exponent that was larger than the one that’s represented in that graph.
I now tried to do figure out the answer to this question using Chinchilla’s loss curves and Kaplan’s adjusted-for-contradiction loss curves, but I realised...
...that Chinchilla’s “loss” and Kaplan’s “loss” are pretty incomparable.
It’s unsurprising that they’re somewhat different (they might have used different datasets or something, when evaluating the loss), but I am surprised that Chinchilla’s curves uses an additive term that predicts that loss will never go below 1.69. What happened with the claims that ideal text-prediction performance was like 0.7? (E.g. see here for me asking why gwern estimates 0.7, and gwern responding.)
Anyway, this makes it very non-obvious to me how to directly translate my benchmark extrapolations to a chinchilla context. Given that their “loss” is so different, I don’t know what I could reasonably assume about the relationship between [benchmark performance as a function of chinchilla!loss] and [benchmark performance as a function of gpt-3!loss].
but I am surprised that Chinchilla’s curves uses an additive term that predicts that loss will never go below 1.69. What happened with the claims that ideal text-prediction performance was like 0.7?
Apples & oranges, you’re comparing different units. Comparing token perplexities is hard when the tokens (not to mention datasets) differ. Chinchilla isn’t a character-level model but BPEs (well, they saySentencePiece which is more or less BPEs), and BPEs didn’t even exist until the past decade so there will be no human estimates which are in BPE units (and I pity any subjects who are supposed to try to learn to predict the OA BPEs). If you want to handwave, BPEs are, I think, roughly equivalent to like 3 characters or bytes, so a bad upper bound on what ideal BPE loss would be 0.7*3=2.1, which is consistent with Chinchilla & Gopher hitting <2 BPE loss.
They do include bits-per-byte losses which vary widely but are indeed much closer to 0.7 than 1.69: https://arxiv.org/pdf/2203.15556.pdf#page=30 But no scaling laws on those you can grab an intrinsic entropy/irreducible loss from. Maybe there’s some way to average over those bit-per-byte laws and translate the scaling law? The estimate would be pretty unstable, however: you can see how much the different corpuses vary, often by many times what the absolute remaining distance-to-true-human-loss must be.
NB: Loss ≠ perplexity. Perplexity is the exponential of the entropy, and you have to take a logarithm before comparing it to bits-per-thing. 1.69 is a loss, not a perplexity, which is already in nats (which are a constant factor different to bits). An example of perplexity is Chinchilla getting 7.16 (~e1.97) on Wikitext103.
BPEs are, I think, roughly equivalent to like 3 characters or bytes
A nat-per-BPE is about 1⁄3bits-per-byte. A BPE is thus around 4.3 (log2(7.16)0.667≈4.26) characters. I am not 100% sure I did that right but that seems like a more sensible answer.
It is annoying that one paper uses three different units for the same thing depending on the dataset, and the base isn’t even explicit in some of them, instead of just reporting everything in bits per byte. But what are you going to do, expect people to coordinate? Ridiculous. Much better to just confuse people all the time.
I am not 100% sure I did that right but that seems like a more sensible answer.
Eyy, I should trust myself more. Verified on Pile-CC.
(GPT-2/3 BPE)
>>> k = 100000000; k / len(tokenizer(cc[:k])["input_ids"])
4.355680325470372
(T5 sentencepiece)
>>> k = 10000000; k / len(tokenizer(cc[:k])["input_ids"])
4.182535904979476
You calculated things for the neural network brain size anchor; now here’s the peformance scaling trend calculation (I think):
I took these graphs from the Chinchilla paper and then made them transparent and superimposed them on one another and then made a copy on the right to extend the line. And I drew some other lines to extend them.
Eyeballing this graph it looks like whatever performance we could achieve with 10^27 FLOPs under the Kaplan scaling laws, we can now achieve with 10^25 FLOPs. (!!!) This is a big deal if true. Am I reasoning incorrectly here?
If this is anywhere close to correct, then the distinction you mention between two methods of getting timelines—“Assume it happens when we train a brain-sized model compute-optimally” vs. “assume it happens when we get to superhuman performance on this ensemble of benchmarks that we already have GPT trends for” becomes even more exciting and important than I thought! It’s like, a huge huge crux, because it basically makes for a 4 OOM difference!
EDIT: To be clear, if this is true then I think I should update away from the second method, on the grounds that it predicts we are only about 1 OOM away and that seems implausible.
First I gotta say: I thought I knew the art of doing quick-and-dirty calculations, but holy crap, this methodology is quick-and-dirty-ier than I would ever have thought of. I’m impressed.
But I don’t think it currently gets to right answer. One salient thing: it doesn’t take into account Kaplan’s “contradiction”. I.e., Kaplan’s laws already suggested that once we were using enough FLOP, we would have to scale data faster than we have to do in the short term. So when I made my extrapolations, I used a data-exponent that was larger than the one that’s represented in that graph.
I now tried to do figure out the answer to this question using Chinchilla’s loss curves and Kaplan’s adjusted-for-contradiction loss curves, but I realised...
...that Chinchilla’s “loss” and Kaplan’s “loss” are pretty incomparable.
It’s unsurprising that they’re somewhat different (they might have used different datasets or something, when evaluating the loss), but I am surprised that Chinchilla’s curves uses an additive term that predicts that loss will never go below 1.69. What happened with the claims that ideal text-prediction performance was like 0.7? (E.g. see here for me asking why gwern estimates 0.7, and gwern responding.)
Anyway, this makes it very non-obvious to me how to directly translate my benchmark extrapolations to a chinchilla context. Given that their “loss” is so different, I don’t know what I could reasonably assume about the relationship between [benchmark performance as a function of chinchilla!loss] and [benchmark performance as a function of gpt-3!loss].
Apples & oranges, you’re comparing different units. Comparing token perplexities is hard when the tokens (not to mention datasets) differ. Chinchilla isn’t a character-level model but BPEs (well, they say SentencePiece which is more or less BPEs), and BPEs didn’t even exist until the past decade so there will be no human estimates which are in BPE units (and I pity any subjects who are supposed to try to learn to predict the OA BPEs). If you want to handwave, BPEs are, I think, roughly equivalent to like 3 characters or bytes, so a bad upper bound on what ideal BPE loss would be 0.7*3=2.1, which is consistent with Chinchilla & Gopher hitting <2 BPE loss.
They do include bits-per-byte losses which vary widely but are indeed much closer to 0.7 than 1.69: https://arxiv.org/pdf/2203.15556.pdf#page=30 But no scaling laws on those you can grab an intrinsic entropy/irreducible loss from. Maybe there’s some way to average over those bit-per-byte laws and translate the scaling law? The estimate would be pretty unstable, however: you can see how much the different corpuses vary, often by many times what the absolute remaining distance-to-true-human-loss must be.
NB: Loss ≠ perplexity. Perplexity is the exponential of the entropy, and you have to take a logarithm before comparing it to bits-per-thing. 1.69 is a loss, not a perplexity, which is already in nats (which are a constant factor different to bits). An example of perplexity is Chinchilla getting 7.16 (~e1.97) on Wikitext103.
A nat-per-BPE is about 1⁄3 bits-per-byte. A BPE is thus around 4.3 (log2(7.16)0.667≈4.26) characters. I am not 100% sure I did that right but that seems like a more sensible answer.
It is annoying that one paper uses three different units for the same thing depending on the dataset, and the base isn’t even explicit in some of them, instead of just reporting everything in bits per byte. But what are you going to do, expect people to coordinate? Ridiculous. Much better to just confuse people all the time.
Eyy, I should trust myself more. Verified on Pile-CC.
Thanks Lanrian and Gwern! Alas that my quick-and-dirty method is insufficient.