Thanks! At least for Gopher, if you look at correlations between reductions in log-error (which I think is the scaling laws literature suggests would be the more natural framing) you find a more tighter relationship, particularly when looking at the relatively smaller models.
Thanks! Is the important thing there log-error, though, or just that if the absolute performance difference between models is small enough, then different task performance between the two is noise (as in parallel runs of the same model) and you do wind up reverting to the mean?
I can’t get the image to display, but here’s an example of how you get a negative correlation if your runs are random draws from the same Gaussian:
I’m not sure what you mean; I’m not looking at log-odds. Maybe the correlation is an artefact from noise being amplified in log-space (I’m not sure), but it’s not obvious to me that this isn’t the correct way to analyse the data.
Nitpick: wouldn’t this graph be much more natural with the x and y axes reversed? I’d want to input the reduction in log-error over a cheaper compute regime to predict the reduction in log-error over a more expensive one.
Thanks! At least for Gopher, if you look at correlations between reductions in log-error (which I think is the scaling laws literature suggests would be the more natural framing) you find a more tighter relationship, particularly when looking at the relatively smaller models.
Thanks! Is the important thing there log-error, though, or just that if the absolute performance difference between models is small enough, then different task performance between the two is noise (as in parallel runs of the same model) and you do wind up reverting to the mean?
I can’t get the image to display, but here’s an example of how you get a negative correlation if your runs are random draws from the same Gaussian:
https://i.imgur.com/xhtIX8F.png
I’m not sure what you mean; I’m not looking at log-odds. Maybe the correlation is an artefact from noise being amplified in log-space (I’m not sure), but it’s not obvious to me that this isn’t the correct way to analyse the data.
Here’s the corresponding graph for the non-logged difference, which also displays a large correlation.
Nitpick: wouldn’t this graph be much more natural with the x and y axes reversed? I’d want to input the reduction in log-error over a cheaper compute regime to predict the reduction in log-error over a more expensive one.
How much does this change when you remove the big outlier in the top left?