An interesting pattern I’ve now noticed across many different domains is that if we try to do an attribution of improvements in outcomes or performance in a domain to the two categories “we’re using more resources now than in the past” and “we’re making more efficient use of resources now than in the past”, there is usually an even split in how much improvement can be attributed to each category.
Some examples:
In computer vision, Erdil and Besiroglu (2022) (my own paper) estimates that 40% of performance improvements in computer vision from 2012 to 2022 have been due to better algortihms, and 60% due to the scaling of compute and data.
In computer chess, a similar pattern seems to hold: roughly half of the progress in chess engine performance from Deep Blue to 2015 has been from the scaling of compute, and half from better algorithms. Stockfish 8 running on consumer hardware in 1997 could achieve an Elo rating of ~ 3000, compared to ~ 2500 for contemporary software; and Stockfish 8 on 2015 hardware could go up to ~ 3400.
In rapidly growing economies, accounting for growth in output per worker by dividing it into capital per worker (resource scaling) and TFP (efficiency scaling, roughly speaking) often gives an even split: see Bosworth and Collins (2008) for data on China and India specifically.
More pessimistic estimates of the growth performance of China compared to official data put this split at 75% to 25% (see this post for details) but the two effects are still at least comparable.
A toy model
A speculative explanation is the following: if we imagine that performance in some domain is measured by a multiplicative index P which can be decomposed as the product of individual contributing factors F1,F2,…,Fn so that P∝∏ni=1Fi, in general we’ll have
gP=1PdPdt=n∑i=11FidFidt=n∑i=1gFi
thanks to the product rule. Note that gX denotes the growth rate of the variable X.
I now want to use a law of motion from Jones (1995) for Fi: we assume they evolve over time according to
gFi=1FidFidt=θiF−βiiIλii
where θi,βi,λi>0 are parameters and Ii is a measure of “investment input” into factor i. This general specification can capture diminishing returns on investment as we make progress or scale up resources thanks to β, and can capture returns to scale to spending more resources on investment at a given time thanks to λ.
Substituting this into the growth expression for P gives
gP=1PdPdt=n∑i=1θiF−βiiIλii
Now, suppose we have a fixed budget I at any given time to allocate across all investments Ii, and our total budget I grows over time at a rate g. To maximize the rate of progress at a given time, the marginal returns to investment across all factors should be equal, i.e. we should have
∂∂Ii(1FidFidt)=∂∂Ij(1FjdFjdt)
for all pairs i,j. Substitution gives
θiλiF−βiiIλi−1i=θjλjF−βjjIλj−1j
and upon simplification, we recover
λigFiIi=λjgFiIj
In an equilibrium where all quantities grow exponentially, the ratios Ii/Ij must therefore remain constant, i.e. all of the Ii must also grow at the aggregate rate of input growth g. Then, it’s easy to see that the Jones law of motion implies gFi=gλi/βi for each factor i, from which we get the important conclusion
gFi∝λiβi=ri
that must hold in an exponential growth equilibrium. The parameter ri is often called the returns to investment, so this relation says that distinct factors account for growth in P proportional to their returns to investment parameter.
How do we interpret the data in light of the toy model?
If we simplify the setup and make it about two factors, one measuring resource use and the other measuring efficiency, then the fact that the two factors account for comparable fractions in overall progress should mean that their associated return parameters ri are close. This prediction is validated in some domains:
A linearly accumulating factor that follows dF/dt∝I has r=1, and if this factor is raised to a power α its returns become r=α.
Since the capital share of the US economy is around 30%, meaning roughly that we think total output scales with the capital stock raised to a power of 0.3; and capital is thought to accumulate linearly, the parity hypothesis predicts r=0.3 for US total factor productivity. Bloom et al. (2020) validates this prediction almost exactly.
When we focus on niche problems, we must be careful because some efficiency terms (e.g. Moore’s law) are driven by worldwide investment that’s not optimized for our niche application. This then becomes an exogenous progress term, and it becomes better to compare endogenous terms with each other instead.
In machine learning, compute scaling has been much faster than Moore’s law, so we expect our model to be a good approximation in this case. If we assume spending on compute accumulates linearly, so that rcompute=1, the parity results suggest we should also expect rsoftware≈1. I have some unpublished estimates that support something like this, but a public reference is Tom Davidson’s takeoff model, where he estimates r≈2.5 in computer vision. Not quite 1, but not too far off either.
I think the toy model generally suggests that refficiency tends to be O(1). We have no good a priori reason to expect this, but in fact, it seems to be the case. From one perspective, this is really another way of phrasing the parity hypothesis, but I think it’s one that might be useful in thinking about the phenomenon.
Of course, there are other possible explanations for this “coincidence”, but none of them seem completely satisfactory, which is rather disappointing.
Efficiency and resource use scaling parity
An interesting pattern I’ve now noticed across many different domains is that if we try to do an attribution of improvements in outcomes or performance in a domain to the two categories “we’re using more resources now than in the past” and “we’re making more efficient use of resources now than in the past”, there is usually an even split in how much improvement can be attributed to each category.
Some examples:
In computer vision, Erdil and Besiroglu (2022) (my own paper) estimates that 40% of performance improvements in computer vision from 2012 to 2022 have been due to better algortihms, and 60% due to the scaling of compute and data.
In computer chess, a similar pattern seems to hold: roughly half of the progress in chess engine performance from Deep Blue to 2015 has been from the scaling of compute, and half from better algorithms. Stockfish 8 running on consumer hardware in 1997 could achieve an Elo rating of ~ 3000, compared to ~ 2500 for contemporary software; and Stockfish 8 on 2015 hardware could go up to ~ 3400.
In rapidly growing economies, accounting for growth in output per worker by dividing it into capital per worker (resource scaling) and TFP (efficiency scaling, roughly speaking) often gives an even split: see Bosworth and Collins (2008) for data on China and India specifically.
More pessimistic estimates of the growth performance of China compared to official data put this split at 75% to 25% (see this post for details) but the two effects are still at least comparable.
A toy model
A speculative explanation is the following: if we imagine that performance in some domain is measured by a multiplicative index P which can be decomposed as the product of individual contributing factors F1,F2,…,Fn so that P∝∏ni=1Fi, in general we’ll have
gP=1PdPdt=n∑i=11FidFidt=n∑i=1gFi
thanks to the product rule. Note that gX denotes the growth rate of the variable X.
I now want to use a law of motion from Jones (1995) for Fi: we assume they evolve over time according to
gFi=1FidFidt=θiF−βiiIλii
where θi,βi,λi>0 are parameters and Ii is a measure of “investment input” into factor i. This general specification can capture diminishing returns on investment as we make progress or scale up resources thanks to β, and can capture returns to scale to spending more resources on investment at a given time thanks to λ.
Substituting this into the growth expression for P gives
gP=1PdPdt=n∑i=1θiF−βiiIλii
Now, suppose we have a fixed budget I at any given time to allocate across all investments Ii, and our total budget I grows over time at a rate g. To maximize the rate of progress at a given time, the marginal returns to investment across all factors should be equal, i.e. we should have
∂∂Ii(1FidFidt)=∂∂Ij(1FjdFjdt)
for all pairs i,j. Substitution gives
θiλiF−βiiIλi−1i=θjλjF−βjjIλj−1j
and upon simplification, we recover
λigFiIi=λjgFiIj
In an equilibrium where all quantities grow exponentially, the ratios Ii/Ij must therefore remain constant, i.e. all of the Ii must also grow at the aggregate rate of input growth g. Then, it’s easy to see that the Jones law of motion implies gFi=gλi/βi for each factor i, from which we get the important conclusion
gFi∝λiβi=ri
that must hold in an exponential growth equilibrium. The parameter ri is often called the returns to investment, so this relation says that distinct factors account for growth in P proportional to their returns to investment parameter.
How do we interpret the data in light of the toy model?
If we simplify the setup and make it about two factors, one measuring resource use and the other measuring efficiency, then the fact that the two factors account for comparable fractions in overall progress should mean that their associated return parameters ri are close. This prediction is validated in some domains:
A linearly accumulating factor that follows dF/dt∝I has r=1, and if this factor is raised to a power α its returns become r=α.
Since the capital share of the US economy is around 30%, meaning roughly that we think total output scales with the capital stock raised to a power of 0.3; and capital is thought to accumulate linearly, the parity hypothesis predicts r=0.3 for US total factor productivity. Bloom et al. (2020) validates this prediction almost exactly.
When we focus on niche problems, we must be careful because some efficiency terms (e.g. Moore’s law) are driven by worldwide investment that’s not optimized for our niche application. This then becomes an exogenous progress term, and it becomes better to compare endogenous terms with each other instead.
In machine learning, compute scaling has been much faster than Moore’s law, so we expect our model to be a good approximation in this case. If we assume spending on compute accumulates linearly, so that rcompute=1, the parity results suggest we should also expect rsoftware≈1. I have some unpublished estimates that support something like this, but a public reference is Tom Davidson’s takeoff model, where he estimates r≈2.5 in computer vision. Not quite 1, but not too far off either.
I think the toy model generally suggests that refficiency tends to be O(1). We have no good a priori reason to expect this, but in fact, it seems to be the case. From one perspective, this is really another way of phrasing the parity hypothesis, but I think it’s one that might be useful in thinking about the phenomenon.
Of course, there are other possible explanations for this “coincidence”, but none of them seem completely satisfactory, which is rather disappointing.