For instance, it seems plausible that if “adding arabic numerals” and “translating words into arabic numerals” are two groups but “adding numbers written as words” is not, performance on the latter could nonetheless develop smoothly as the model gets better at the others. It would certainly be weird if performance ”adding numbers written as words” advanced as a sudden leap in this case.
I wouldn’t say this is weird. This is kind of the point of meta-learning, or ‘transfer’ in a broad sense: you train on X, and Y gets better! Or look at emergent capabilities: they don’t spike because of additional data being added (the token count is similar or identical), so it has to be because of larger models in some way transferring from other datapoints.
There also seems to be a premise running through this proposal that learning is simple and independent, in some sense, and that you are mostly just oversampling/undersampling as a throttle, as it were, to avoid spikes by throttling each task individually instead of only the global loss which is too loose and leaves too much wiggle room because individual tasks are a minuscule fraction of the overall average ‘task’. But we have plenty of evidence that how you weight or group data would change the dynamics and capabilities quantitatively and qualitatively; the most striking recent research result which implies that how you group data can change what is learned qualitatively is DM’s “Data Distributional Properties Drive Emergent In-Context Learning in Transformers”, Chan et al 2022:
Large transformer-based models are able to perform in-context few-shot learning, without being explicitly trained for it. This observation raises the question: what aspects of the training regime lead to this emergent behavior? Here, we show that this behavior is driven by the distributions of the training data itself.
In-context learning emerges when the training data exhibits particular distributional properties such as burstiness (items appear in clusters rather than being uniformly distributed over time) and having large numbers of rarely occurring classes. In-context learning also emerges more strongly when item meanings or interpretations are dynamic rather than fixed. These properties are exemplified by natural language, but are also inherent to naturalistic data in a wide range of other domains. They also depart significantly from the uniform, i.i.d. training distributions typically used for standard supervised learning.
In our initial experiments, we found that in-context learning traded off against more conventional weight-based learning, and models were unable to achieve both simultaneously. However, our later experiments uncovered that the two modes of learning could co-exist in a single model when it was trained on data following a skewed Zipfian distribution—another common property of naturalistic data, including language. In further experiments, we found that naturalistic data distributions were only able to elicit in-context learning in transformers, and not in recurrent models.
In sum, our findings indicate how the transformer architecture works together with particular properties of the training data to drive the intriguing emergent in-context learning behaviour of large language models, and how future work might encourage both in-context and in-weights learning in domains beyond language.
Here, the distribution of tasks (known image classes) affects the kind of learning of other tasks (classes): the presence of a common class or a rare class, as opposed to a middle class, skews the model as a whole, across all future classes, away from meta-learning.
I take this as implying that if you did something like extract the implicit tasks of a big Internet scrape and did the obvious thing of rebalancing classes away from Zipfian distribution to a uniform distribution closer to something like ImageNet with 1000 classes roughly the same size, you would get models which might be much more efficient to train or might have the same or lower training loss, but would have a very different set of strengths and weaknesses—possibly, in the extreme case, they might have no few-shot capability at all! (This alternative model is probably very far away in model space from the normal meta-learning one, having learned a fundamentally different approach, so I doubt any considerations of local gradients or model properties is going to be useful.) This is a more extreme version of my concern with MoEs that using experts to solve specific problems rather than a single universal dense model will tend to sabotage learning of interesting capabilities: here, it’s not merely that MoEs seem to do slightly better on memorization-heavy benchmarks than reasoning ones, it’s that the meta-learning doesn’t happen at all!
And the strangeness probably doesn’t stop there. If you trained some large model in such a manner and it was completely crippled in some respects (while presumably having perhaps more than offsetting gains elsewhere), what would happen if you then further trained it on a Zipfian dataset which hadn’t been rebalanced? I would hazard the guess that it might learn the suppressed capabilities relatively rapidly. This would be very bad for safety purposes if you thought you trained a safe model you could release publicly, say, which did all sorts of useful things but couldn’t be made to do dangerous new things; and yet all you did was create a capabilities overhang for the first person to come along to unlock by finetuning.
This is kind of the point of meta-learning, or ‘transfer’ in a broad sense: you train on X, and Y gets better!
I’m not saying that the knowledge doesn’t transfer, I’m saying it would seem weird if it transferred sharply. Specifically, if task Z is composed of performing task X then task Y, I would expect improving X to improve Z, and I would expect improving Y to improve Z, and I would expect P(Z performed correctly) to be given by P(X performed correctly) and P(Y performed correctly). I think that means Z will improve a bit more sharply than either X or Y, but not drastically so?
But I could absolutely be wrong here! Real models do things undreamt of in theory.
But we have plenty of evidence that how you weight or group data would change the dynamics and capabilities quantitatively and qualitatively … it’s not merely that MoEs seem to do slightly better on memorization-heavy benchmarks than reasoning ones, it’s that the meta-learning doesn’t happen at all!
The first part is what I’m hoping for: I want it to have different dynamics and capabilities, at least at intermediate stages… it’s fine if it eventually gets to the same place.
The second part would definitely be bad, if only because it’s a heavy alignment tax and if this incurs a large tax it’s a non-starter. Thanks for your intuition around this!
I would hazard the guess that it might learn the suppressed capabilities relatively rapidly. This would be very bad for safety purposes if you thought you trained a safe model you could release publicly, say, which did all sorts of useful things but couldn’t be made to do dangerous new things; and yet all you did was create a capabilities overhang for the first person to come along to unlock by finetuning.
That indeed seems bad. And to make sure I’ve got it right, the intuition here is that the model strongly “wants” to learn the suppressed features (because they’re very instrumental on the simple loss)? I guess the other thing that could happen is that you’ve screwed the model up too badly by training it on this grouped loss, so that those features are really far out of reach. I’m not quite sure how to think about this.
My takeaway is that to the extent this helps with safety, it’s a brittle strategy, and it has a good chance of incurring too-large a performance penalty to be viable in a competitive world.
I wouldn’t say this is weird. This is kind of the point of meta-learning, or ‘transfer’ in a broad sense: you train on X, and Y gets better! Or look at emergent capabilities: they don’t spike because of additional data being added (the token count is similar or identical), so it has to be because of larger models in some way transferring from other datapoints.
There also seems to be a premise running through this proposal that learning is simple and independent, in some sense, and that you are mostly just oversampling/undersampling as a throttle, as it were, to avoid spikes by throttling each task individually instead of only the global loss which is too loose and leaves too much wiggle room because individual tasks are a minuscule fraction of the overall average ‘task’. But we have plenty of evidence that how you weight or group data would change the dynamics and capabilities quantitatively and qualitatively; the most striking recent research result which implies that how you group data can change what is learned qualitatively is DM’s “Data Distributional Properties Drive Emergent In-Context Learning in Transformers”, Chan et al 2022:
Here, the distribution of tasks (known image classes) affects the kind of learning of other tasks (classes): the presence of a common class or a rare class, as opposed to a middle class, skews the model as a whole, across all future classes, away from meta-learning.
I take this as implying that if you did something like extract the implicit tasks of a big Internet scrape and did the obvious thing of rebalancing classes away from Zipfian distribution to a uniform distribution closer to something like ImageNet with 1000 classes roughly the same size, you would get models which might be much more efficient to train or might have the same or lower training loss, but would have a very different set of strengths and weaknesses—possibly, in the extreme case, they might have no few-shot capability at all! (This alternative model is probably very far away in model space from the normal meta-learning one, having learned a fundamentally different approach, so I doubt any considerations of local gradients or model properties is going to be useful.) This is a more extreme version of my concern with MoEs that using experts to solve specific problems rather than a single universal dense model will tend to sabotage learning of interesting capabilities: here, it’s not merely that MoEs seem to do slightly better on memorization-heavy benchmarks than reasoning ones, it’s that the meta-learning doesn’t happen at all!
And the strangeness probably doesn’t stop there. If you trained some large model in such a manner and it was completely crippled in some respects (while presumably having perhaps more than offsetting gains elsewhere), what would happen if you then further trained it on a Zipfian dataset which hadn’t been rebalanced? I would hazard the guess that it might learn the suppressed capabilities relatively rapidly. This would be very bad for safety purposes if you thought you trained a safe model you could release publicly, say, which did all sorts of useful things but couldn’t be made to do dangerous new things; and yet all you did was create a capabilities overhang for the first person to come along to unlock by finetuning.
I’m not saying that the knowledge doesn’t transfer, I’m saying it would seem weird if it transferred sharply. Specifically, if task Z is composed of performing task X then task Y, I would expect improving X to improve Z, and I would expect improving Y to improve Z, and I would expect P(Z performed correctly) to be given by P(X performed correctly) and P(Y performed correctly). I think that means Z will improve a bit more sharply than either X or Y, but not drastically so?
But I could absolutely be wrong here! Real models do things undreamt of in theory.
The first part is what I’m hoping for: I want it to have different dynamics and capabilities, at least at intermediate stages… it’s fine if it eventually gets to the same place.
The second part would definitely be bad, if only because it’s a heavy alignment tax and if this incurs a large tax it’s a non-starter. Thanks for your intuition around this!
That indeed seems bad. And to make sure I’ve got it right, the intuition here is that the model strongly “wants” to learn the suppressed features (because they’re very instrumental on the simple loss)? I guess the other thing that could happen is that you’ve screwed the model up too badly by training it on this grouped loss, so that those features are really far out of reach. I’m not quite sure how to think about this.
My takeaway is that to the extent this helps with safety, it’s a brittle strategy, and it has a good chance of incurring too-large a performance penalty to be viable in a competitive world.