At the moment, to a first approximation, the power of our systems is proportional to the logarithm of their number of parameters, and again to a first approximation, we need to take a gradient step per parameter in training.
This is a bit of an unrelated aside, but I don’t think it’s so clear that “power” is logarithmic (or what power means).
One way we could try to measure this is via something like effective population. If N models with 2M parameters are as useful as kN models with M parameters, what is k? In cases where we can measure I think realistic values tend to be >4. That is, if you had a billion models with N parameters working together in a scientific community, I think you’d get more work out of 250 million models with 2N parameters, and so have great efficiency per unit of compute.
There’s still a question of how e.g. scientific output scales with population. One way you can measure it is by asking “If N people working for 2M years, is as useful as kN people working for M years, what is k?” where I think that you also tend to get numbers in the ballpark of 4, though this is even harder to measure than the question about models. But I think most economists would guess this is more like root(N) than log(N).
That still leaves the question of how scientific output scales with time spent thinking. In this case it seems more like an arbitrary choice of units for measuring “scientific output.” E.g. I think there’s a real sense in which each improvement to semiconductors takes exponentially more effort than the unit before. But the upshot of all of that is that if you spend 2x as many years, we expect to be able to build computers that are >10x more efficient. So its’ only really logarithmic if you measure “years of input” on a linear scale but “efficiency of output” on a logarithmic scale. Other domains beyond semiconductors grow less explosively quickly, but seem to have qualitatively similar behavior. See e.g. are ideas getting harder to find?
Quick comment (not sure it’s realted to any broader points): total compute for N models with 2M parameters is roughly 4NM^2 (since per Chinchilla, number of inference steps scales linearly with model size, and number of floating point operations also scales linearly, see also my calculations here). So an equal total compute cost would correspond to k=4.
What I was thinking when I said “power” is that it seems that in most BIG-Bench scales, if you put the y axis some measure of performance (e.g. accuracy) then it seems to scale as some linear or polynomial way in the log of parameters, and indeed I belive the graphs in that paper usually have log parameters in the X axis. It does seem that when we start to saturate performance (error tends to zero), the power laws kick in, and its more like inverse polynomial in the total number of parameters than their log.
This is a bit of an unrelated aside, but I don’t think it’s so clear that “power” is logarithmic (or what power means).
One way we could try to measure this is via something like effective population. If N models with 2M parameters are as useful as kN models with M parameters, what is k? In cases where we can measure I think realistic values tend to be >4. That is, if you had a billion models with N parameters working together in a scientific community, I think you’d get more work out of 250 million models with 2N parameters, and so have great efficiency per unit of compute.
There’s still a question of how e.g. scientific output scales with population. One way you can measure it is by asking “If N people working for 2M years, is as useful as kN people working for M years, what is k?” where I think that you also tend to get numbers in the ballpark of 4, though this is even harder to measure than the question about models. But I think most economists would guess this is more like root(N) than log(N).
That still leaves the question of how scientific output scales with time spent thinking. In this case it seems more like an arbitrary choice of units for measuring “scientific output.” E.g. I think there’s a real sense in which each improvement to semiconductors takes exponentially more effort than the unit before. But the upshot of all of that is that if you spend 2x as many years, we expect to be able to build computers that are >10x more efficient. So its’ only really logarithmic if you measure “years of input” on a linear scale but “efficiency of output” on a logarithmic scale. Other domains beyond semiconductors grow less explosively quickly, but seem to have qualitatively similar behavior. See e.g. are ideas getting harder to find?
Quick comment (not sure it’s realted to any broader points): total compute for N models with 2M parameters is roughly 4NM^2 (since per Chinchilla, number of inference steps scales linearly with model size, and number of floating point operations also scales linearly, see also my calculations here). So an equal total compute cost would correspond to k=4.
What I was thinking when I said “power” is that it seems that in most BIG-Bench scales, if you put the y axis some measure of performance (e.g. accuracy) then it seems to scale as some linear or polynomial way in the log of parameters, and indeed I belive the graphs in that paper usually have log parameters in the X axis. It does seem that when we start to saturate performance (error tends to zero), the power laws kick in, and its more like inverse polynomial in the total number of parameters than their log.