Great collection of results. I particularly found the interactive graph useful.
I’m slightly confused by the trend lines (especially for Games and Other) - they don’t seem intuitively the best fits. It looks like they place a lot of importance on the high parameter recent models (possibly the cost for each datapoint is in parameter space rather than log(parameter) space?
Thank you! I think you are right—by default the Altair library (what we used to plot the regressions) does OLS fitting of an exponential instead of fitting a linear model over the log transform. We’ll look into this and report back.
If you are still interested in fiddling with this graph, here’s a variant I’d love to see:
Remove all the datapoints in each AI category that are not record-setting, such that each category just tracks the largest available models at any given time. Then compute the best fit lines for the resulting categories. (Because this is what would be useful for predicting what the biggest models will be in year X, whereas the current design is for predicting what the average model size will be in year X… right?)
Good suggestion! Understanding the trend of record-setting would be interesting indeed so that we avoid the pesky influence of the systems which are below the trend like CURL in the game domain.
The problem with the naive setup of just regressing on record-setters is that is quite sensitive to noise—one early outlier in the trend can completely alter the result.
I explore a similar problem in my paper Forecasting timelines of quantum computing, where we try to extrapolate progress on some key metrics like qubit count and gate error rate. The method we use in the paper to address this issue is to bootstrap the input and predict a range of possible growth rates—that way outliers do not completely dominate the result.
I will probably not do it right now for this dataset, though I’d be interested in having other people try that if they are so inclined!
Great collection of results. I particularly found the interactive graph useful.
I’m slightly confused by the trend lines (especially for Games and Other) - they don’t seem intuitively the best fits. It looks like they place a lot of importance on the high parameter recent models (possibly the cost for each datapoint is in parameter space rather than log(parameter) space?
Thank you! I think you are right—by default the Altair library (what we used to plot the regressions) does OLS fitting of an exponential instead of fitting a linear model over the log transform. We’ll look into this and report back.
If you are still interested in fiddling with this graph, here’s a variant I’d love to see:
Remove all the datapoints in each AI category that are not record-setting, such that each category just tracks the largest available models at any given time. Then compute the best fit lines for the resulting categories. (Because this is what would be useful for predicting what the biggest models will be in year X, whereas the current design is for predicting what the average model size will be in year X… right?)
Good suggestion! Understanding the trend of record-setting would be interesting indeed so that we avoid the pesky influence of the systems which are below the trend like CURL in the game domain.
The problem with the naive setup of just regressing on record-setters is that is quite sensitive to noise—one early outlier in the trend can completely alter the result.
I explore a similar problem in my paper Forecasting timelines of quantum computing, where we try to extrapolate progress on some key metrics like qubit count and gate error rate. The method we use in the paper to address this issue is to bootstrap the input and predict a range of possible growth rates—that way outliers do not completely dominate the result.
I will probably not do it right now for this dataset, though I’d be interested in having other people try that if they are so inclined!
OK, sounds good! I know someone who might be interested...
Another, very similar thing that would be good is to just delete all the non-record-setting data points and draw lines to connect the remaining dots.
Also, it would be cool if we could delete all the Mixture of Experts models to see what the “dense” version of the trend looks like.
This is now fixed; see the updated graphs. We have also updated the eye ball estimates accordingly.