Revisiting the Horizon Length Hypothesis
Summary: As part of my work at Epoch, I investigated the horizon length hypothesis—the idea that the horizon length of a task is predictive of the training compute needed to learn that task. My current (weak) conclusion is that the horizon length hypothesis can’t be used in practice to estimate the compute requirements for training transformative AI because of the difficulty of 1) measuring horizon length accurately and 2) accounting for compute-saving techniques like curriculum learning. The evidence is weak, so we decided to publish a summary of our research so far here for public scrutiny and gather feedback while we decide whether we want to work more on the topic.
Introduction
The horizon length hypothesis (HLH) asserts that the amount of compute needed to train an ML system to perform a task is , where is the model size (number of parameters), is the horizon length of the associated task, and is an exponent that is common to all tasks.
The horizon length is intuitively the length of the feedback loops involved in learning the task. For example, balancing a pole with your hand has a subsecond feedback loop, whereas running a company has months- or years-long feedback loops. So we should expect the horizon length of the task ‘Running a company’ to be up to 8 orders of magnitude bigger than that of ‘Balancing a pole’.
Dependence on the training method
Instead of directly training a model to perform a task, one can use transfer learning from a different task, or prepare a curriculum of progressively harder instances of the task. Using these techniques it’s possible to reduce the amount of task-specific data used in training.
Therefore, using these techniques we might be able to save compute by training only on small amounts of data for the long-horizon task, and doing the bulk of the training on short-horizon tasks. So for long-horizon tasks, if we naively apply the HLH directly with the horizon length of the task, we will overestimate training requirements.
However, the HLH might be valid for fairly direct training methods (no transfer, no reward shaping, etc). If this is true, then it might still be informative for predicting compute requirements. In particular, it seems likely that models will need some training at longer horizon lengths. An example of this might be LLMs, where pretraining has a horizon length of 1, but RLHF arguably has a longer horizon.
In this way, training can be decomposed into several phases (or several reward components), each with a defined horizon length. The compute required for each phase could then be estimated with the HLH. I have not tested this approach yet but I think there are relatively simple experiments that could be informative.[1]
Evidence for the HLH
We have some theoretical reasons to believe that model size should be scaled proportionally to the amount of data (that is: ). These come from bounds in statistical learning theory,[2] from some models of scaling laws, and from the fact that for some restricted classes of neural networks and functions it does seem to be true.[3] Separately, in reinforcement learning it is known that tasks with longer time horizons require proportionally more samples.[4] There are also other theoretical analyses that give different results though.
Meanwhile, the exponents measured in the scaling literature are often very different (Figure 1). I don’t think we should take this as a definitive rebuttal of the HLH, because these measurements are quite noisy. In particular, we have reasons to suspect that the experiments in Henighan et al. 2020 are not truly optimal (since this was the case for language). The most careful measurement of this exponent at scale for language was Chinchilla, which did give a value of 1.
The fact that a lot of these exponents are probably wrong, even though the experiments were performed by world-class teams, indicates that in practice measuring the horizon length of a task is quite hard and requires careful experimentation. This makes me skeptical about using it for forecasting.
Conclusion
The large savings in training requirements from using transfer learning, curriculum learning and similar techniques means that naively applying the HLH won’t yield good estimates of training compute requirements for long-horizon tasks.
It’s unclear whether the HLH is true even for more direct training methods. The empirical evidence is inconclusive, while the theoretical evidence seems weakly positive. If it is true, we might be able to use it to estimate training requirements by decomposing training into different components with well-defined horizon lengths.
One additional complication is that accurately measuring the horizon length of a task empirically seems quite hard. Therefore, we would need to rely on heuristics and informal arguments about the nature of the task to guess the horizon length.
Taking all of this into account, I think that using the HLH to make quantitative predictions of training compute requirements is quite hard. However, the HLH might still provide useful qualitative insights.
You can read the full report here, but note that it’s quite rough and unpolished.
- ^
This would involve taking some concrete task and trying several training methods, or possibly combinations of several reward functions with different horizon length.
- ^
Note that the applicability of classical learning theory for deep learning is quite contested, see for example this sequence.
- ^
Eg: for perceptrons (single-layer networks) it is known to be true. Also, there are several results that show data and parameter scaling exponents of 1⁄2 (eg: for two-layer networks, for SGD in convex functions). If this turns out to be the best possible scaling in general, it would imply an optimal scaling exponent of 1.
- ^
This is mainly mediated by the variance in the gradient estimates, which increases with the time horizon, see this. Reducing this variance requires using proportionally more samples.
To check my understanding: Your graph + argument is that we should be fairly uncertain about what the relevant scaling laws will be for AGI, and that it could be anywhere from (say) 0.3 to 1.6. How does this translate into variance in timelines? Well, IIRC Ajeya has 1e35 FLOP as her median for training requirements, and something like 1e16 of that comes from flop-per-subjective-second, and maybe 1e5 from multiple-subjective-seconds-per-feedback-loop/data-point, so that leaves 1e14 for data points / feedback-loops? Which is about as many as you have parameters, consistent with Ajeya’s guess at the scaling laws where the exponent is 0.8.
So if instead you had an exponent of 0.3, data would be cut in half (on a log scale) to something like 1e7? And if you had an exponent of 1.6, data would be 60%-100% more, to something like 1e24?
So, you conclude, there’s such a huge variance in what our timelines should be (like, 15 OOMs on the key variable of AGI training requirements) based on such flimsy evidence, that we should look for a better way to estimate timelines than this.
Am I understanding correctly? (This is all mental math, maybe I’m doing it wrong?)
Not quite. What you said is a reasonable argument, but the graph is noisy enough, and the theoretical arguments convincing enough, that I still assign >50% credence that data (number of feedback loops) should be proportional to parameters (exponent=1).
My argument is that even if the exponent is 1, the coefficient corresponding to horizon length (‘1e5 from multiple-subjective-seconds-per-feedback-loop’, as you said) is hard to estimate.
There are two ways of estimating this factor
Empirically fitting scaling laws for whatever task we care about
Reasoning about the nature of the task and how long the feedback loops are
Number 1 requires a lot of experimentation, choosing the right training method, hyperparameter tuning, etc. Even OpenAI made some mistakes on those experiments. So probably only a handful of entities can accurately measure this coefficient today, and only for known training methods!
Number 2, if done naively, probably overestimates training requirements. When someone learns to run a company, a lot of the relevant feedback loops probably happen on timescales much shorter than months or years. But we don’t know how to perform this decomposition of long-horizon tasks into sets of shorter-horizon tasks, how important each of the subtasks are, etc.
We can still use the bioanchors approach: pick a broad distribution over horizon lengths (short, medium, long). My argument is that outperforming bioanchors by making more refined estimates of horizon length seems too hard in practice to be worth the effort, and maybe we should lean towards shorter horizons being more relevant (because so far we have seen a lot of reduction from longer-horizon tasks to shorter-horizon learning problems, eg expert iteration or LLM pretraining).
OK, I think we are on the same page then. Thanks.
Assuming I’m understanding correctly:
Nice argument. I guess I have a bit more confidence in the scaling laws than you. However, I definitely still agree that our uncertainty about AGI 2023 training compute requirements should range over many OOMs.
However what does this have to do with horizon length? I guess the idea is that the proper scaling law shouldn’t be assumed to be a function of data points at all, but rather data points & what type of task you are training on, and plausibly for longer-horizon tasks you need less data (especially with techniques like imitation learning + finetuning, etc.?) Yep that also seems very plausible to me, it’s a big part of why my timelines are much shorter than Ajeya’s.