tl;dr: if models unpredictably undergo rapid logistic improvement, we should expect zero correlation in aggregate.
If models unpredictably undergo SLOW logistic improvement, we should expect positive correlation. This also means getting more fine-grained data should give different correlations.
To condense and steelman the original comment slightly:
Imagine that learning curves all look like logistic curves. The following points are unpredictable:
How big of a model is necessary to enter the upward slope.
How big of a model is necessary to reach the plateau.
How good of performance the plateau gives.
Would this result in zero correlation between model jumps?
So each model is in one of the following states:
floundering randomly
learning fast
at performance plateau
Then the possible transitions (small → 7B → 280B) are:
1->1->1 : slight negative correlation due to regression to the mean
1->1->2: zero correlation since first change is random, second is always positive
1->1->3: zero correlation as above
1->2->2: positive correlation as the model is improving during both transitions
1->2->3: positive correlation as the model improves during both transitions
1->3->3: zero correlation, as the model is improving in the first transition and random in the second
2->2->2: positive correlation
2->2->3: positive correlation
2->3->3: zero correlation
3->3->3: slight negative correlation due to regression to the mean
That’s two cases of slight negative correlation, four cases of zero correlation, and four cases of positive correlation.
However positive correlation only happens if the middle state is state 2, so only if the 7B model does meaningfully better than the small model, AND is not already saturated.
If the logistic jump is slow (takes >3 OOM) AND we are able to reach it with the 7B model for many tasks, then we would expect to see positive correlation.
However if we assume that
Size of model necessary to enter the upward slope is unpredictable
Size of a model able to saturate performance is rarely >100x models that start to learn
The saturated performance level is unpredictable
Then we will rarely see a 2->2 transition, which means the actual possibilities are:
Two cases of slight negative correlation
Four cases of zero correlation
One case of positive correlation (1->2->3, which should be less common as it requires ‘hitting the target’ of state 2)
Which should average out to around zero or very small positive correlation, as observed.
However, more precise data with smaller model size differences would be able to find patterns much more effectively, as you could establish which of the transition cases you were in.
However again, this model still leaves progress basically “unpredictable” if you aren’t actively involved in the model production, since if you only see the public updates you don’t have the more precise data that could find the correlations.
This seems like evidence for ‘fast takeoff’ style arguments—since we observe zero correlation, if the logistic form holds, that suggests that ability to do a task at all is very near in cost to ability to do a task as well as possible.
tl;dr: if models unpredictably undergo rapid logistic improvement, we should expect zero correlation in aggregate.
If models unpredictably undergo SLOW logistic improvement, we should expect positive correlation. This also means getting more fine-grained data should give different correlations.
To condense and steelman the original comment slightly:
Imagine that learning curves all look like logistic curves. The following points are unpredictable:
How big of a model is necessary to enter the upward slope.
How big of a model is necessary to reach the plateau.
How good of performance the plateau gives.
Would this result in zero correlation between model jumps?
So each model is in one of the following states:
floundering randomly
learning fast
at performance plateau
Then the possible transitions (small → 7B → 280B) are:
1->1->1 : slight negative correlation due to regression to the mean
1->1->2: zero correlation since first change is random, second is always positive
1->1->3: zero correlation as above
1->2->2: positive correlation as the model is improving during both transitions
1->2->3: positive correlation as the model improves during both transitions
1->3->3: zero correlation, as the model is improving in the first transition and random in the second
2->2->2: positive correlation
2->2->3: positive correlation
2->3->3: zero correlation
3->3->3: slight negative correlation due to regression to the mean
That’s two cases of slight negative correlation, four cases of zero correlation, and four cases of positive correlation.
However positive correlation only happens if the middle state is state 2, so only if the 7B model does meaningfully better than the small model, AND is not already saturated.
If the logistic jump is slow (takes >3 OOM) AND we are able to reach it with the 7B model for many tasks, then we would expect to see positive correlation.
However if we assume that
Size of model necessary to enter the upward slope is unpredictable
Size of a model able to saturate performance is rarely >100x models that start to learn
The saturated performance level is unpredictable
Then we will rarely see a 2->2 transition, which means the actual possibilities are:
Two cases of slight negative correlation
Four cases of zero correlation
One case of positive correlation (1->2->3, which should be less common as it requires ‘hitting the target’ of state 2)
Which should average out to around zero or very small positive correlation, as observed.
However, more precise data with smaller model size differences would be able to find patterns much more effectively, as you could establish which of the transition cases you were in.
However again, this model still leaves progress basically “unpredictable” if you aren’t actively involved in the model production, since if you only see the public updates you don’t have the more precise data that could find the correlations.
This seems like evidence for ‘fast takeoff’ style arguments—since we observe zero correlation, if the logistic form holds, that suggests that ability to do a task at all is very near in cost to ability to do a task as well as possible.
I think I endorse this condensation/steelman! Thank you for making it :-)
For more in this vein maybe: why forecasting S-curves is hard. The associated video is pretty great.