If the model suggested in An Information-Theoretic Analysis of In-Context Learning is roughly right (task inference ‘error decays in both the number of training sequences and sequence lengths’, and in particular, linearly and multiplicatively in both terms) then it might also be useful to [pre]fill in [long] context windows with examples for even less task ambiguity. This might also allow for a smaller (and less costly) train / fine-tune dataset, for the same task inference error. In particular, it might be very hard to be competitive by pre-training on purely synthetic data only, and this kind of approach could come in handy in the case where at least some non-synthetic data is still used.
And/or, that technique might be very useful for AIs generating/editing the synthetic data.
On the subject of cost, it’s also possible that we only need 10%, or 1%, or 0.1% of the dataset to illustrate the meaning of the <AI> tag, and the majority or great majority of it can be in human mode. I’m fairly sure both that more will be better, and that there will be diminishing returns from adding more, so if the alignbment tax of doing the entire dataset is too high, investigating how good results we can get with a smaller proprtion would be worth it. I believe the Pretraining Language Models with Human Preferences paper simply did the entire training set, but they were using processing that was a lot cheper to do than what I’m proposing.
Another possibility is that you’d do actually better with an expensive but very-high-quality 0.1% sample created by humans, rather than full coverage done by AI with some human input. My suspicion is that done right a human-AI combination is the way to go, but a small human dataset might be better than a large badly-AI-generated dataset.
If the model suggested in An Information-Theoretic Analysis of In-Context Learning is roughly right (task inference ‘error decays in both the number of training sequences and sequence lengths’, and in particular, linearly and multiplicatively in both terms) then it might also be useful to [pre]fill in [long] context windows with examples for even less task ambiguity. This might also allow for a smaller (and less costly) train / fine-tune dataset, for the same task inference error. In particular, it might be very hard to be competitive by pre-training on purely synthetic data only, and this kind of approach could come in handy in the case where at least some non-synthetic data is still used.
And/or, that technique might be very useful for AIs generating/editing the synthetic data.
On the subject of cost, it’s also possible that we only need 10%, or 1%, or 0.1% of the dataset to illustrate the meaning of the
<AI>
tag, and the majority or great majority of it can be in human mode. I’m fairly sure both that more will be better, and that there will be diminishing returns from adding more, so if the alignbment tax of doing the entire dataset is too high, investigating how good results we can get with a smaller proprtion would be worth it. I believe the Pretraining Language Models with Human Preferences paper simply did the entire training set, but they were using processing that was a lot cheper to do than what I’m proposing.Another possibility is that you’d do actually better with an expensive but very-high-quality 0.1% sample created by humans, rather than full coverage done by AI with some human input. My suspicion is that done right a human-AI combination is the way to go, but a small human dataset might be better than a large badly-AI-generated dataset.