I just got it from the papers and ran a linear regression, using pdftables.com to convert from PDF to Excel. I used pages 68 and 79 in the Gopher paper:
Thanks! At least for Gopher, if you look at correlations between reductions in log-error (which I think is the scaling laws literature suggests would be the more natural framing) you find a more tighter relationship, particularly when looking at the relatively smaller models.
Thanks! Is the important thing there log-error, though, or just that if the absolute performance difference between models is small enough, then different task performance between the two is noise (as in parallel runs of the same model) and you do wind up reverting to the mean?
I can’t get the image to display, but here’s an example of how you get a negative correlation if your runs are random draws from the same Gaussian:
I’m not sure what you mean; I’m not looking at log-odds. Maybe the correlation is an artefact from noise being amplified in log-space (I’m not sure), but it’s not obvious to me that this isn’t the correct way to analyse the data.
Nitpick: wouldn’t this graph be much more natural with the x and y axes reversed? I’d want to input the reduction in log-error over a cheaper compute regime to predict the reduction in log-error over a more expensive one.
This is super interesting. Are you able to share the underlying data?
I just got it from the papers and ran a linear regression, using pdftables.com to convert from PDF to Excel. I used pages 68 and 79 in the Gopher paper:
https://arxiv.org/pdf/2112.11446.pdf
Page 35 in the Chinchilla paper:
https://arxiv.org/pdf/2203.15556.pdf
Pages 79 and 80 in the PaLM paper:
https://arxiv.org/pdf/2203.15556.pdf
Thanks, though I was hoping for something like a Google Sheet containing the data.
OK, here’s a Google sheet I just threw together: https://docs.google.com/spreadsheets/d/1Y_00UcsYZeOwRuwXWD5_nQWAJp4A0aNoySW0EOhnp0Y/edit?usp=sharing
Thanks! At least for Gopher, if you look at correlations between reductions in log-error (which I think is the scaling laws literature suggests would be the more natural framing) you find a more tighter relationship, particularly when looking at the relatively smaller models.
Thanks! Is the important thing there log-error, though, or just that if the absolute performance difference between models is small enough, then different task performance between the two is noise (as in parallel runs of the same model) and you do wind up reverting to the mean?
I can’t get the image to display, but here’s an example of how you get a negative correlation if your runs are random draws from the same Gaussian:
https://i.imgur.com/xhtIX8F.png
I’m not sure what you mean; I’m not looking at log-odds. Maybe the correlation is an artefact from noise being amplified in log-space (I’m not sure), but it’s not obvious to me that this isn’t the correct way to analyse the data.
Here’s the corresponding graph for the non-logged difference, which also displays a large correlation.
Nitpick: wouldn’t this graph be much more natural with the x and y axes reversed? I’d want to input the reduction in log-error over a cheaper compute regime to predict the reduction in log-error over a more expensive one.
How much does this change when you remove the big outlier in the top left?