“Data wall” is not about running out of data, at least if we are targeting a level of capabilities rather than cheaper inference with overtrained models. In the sense of literally running out of data, there is no data wall because data can be repeated in training, and with Chinchilla scaling an increase in the amount of data (including repetition) results in quadratic increase in compute. This makes even publicly available data sufficient for $100 billion training runs. We are going to run out of compute/funding before we run out of data (absent autonomous long-horizon agents, but then resulting faster algorithmic progress makes straightforward scaling of LLMs less relevant).
The reason repeated data is rarely heard about is that there is still enough unique data at current scale. And until recently the prevailing wisdom was apparently that for LLMs repeating data is really bad, possibly until the Galactica paper, see Figure 6. But where data is more scarce, like code data in StarCoder 2, this technique is already in use (they train for up to 5 epochs). Repeating about 4 times is no worse than having 4 times more unique data, and repeating about 16 times is still highly useful (see Figure 5 in the above paper on scaling laws for repeated data).
So concerns about “data wall” are more about doing better on training quality per unit of data, whether natural or synthetic. It’s about compute efficiency rather than sample efficiency with respect to external data. In which case it would be useful to apply such techniques even if there was unlimited unique data, as they should allow training smarter systems with less compute.
“Data wall” is not about running out of data, at least if we are targeting a level of capabilities rather than cheaper inference with overtrained models. In the sense of literally running out of data, there is no data wall because data can be repeated in training, and with Chinchilla scaling an increase in the amount of data (including repetition) results in quadratic increase in compute. This makes even publicly available data sufficient for $100 billion training runs. We are going to run out of compute/funding before we run out of data (absent autonomous long-horizon agents, but then resulting faster algorithmic progress makes straightforward scaling of LLMs less relevant).
The reason repeated data is rarely heard about is that there is still enough unique data at current scale. And until recently the prevailing wisdom was apparently that for LLMs repeating data is really bad, possibly until the Galactica paper, see Figure 6. But where data is more scarce, like code data in StarCoder 2, this technique is already in use (they train for up to 5 epochs). Repeating about 4 times is no worse than having 4 times more unique data, and repeating about 16 times is still highly useful (see Figure 5 in the above paper on scaling laws for repeated data).
So concerns about “data wall” are more about doing better on training quality per unit of data, whether natural or synthetic. It’s about compute efficiency rather than sample efficiency with respect to external data. In which case it would be useful to apply such techniques even if there was unlimited unique data, as they should allow training smarter systems with less compute.