Adam Jermyn comments on Grouped Loss may disfavor discontinuous capabilities

Adam Jermyn 10 Jul 2022 5:43 UTC
LW: 2 AF: 2
0
AF
This is kind of the point of meta-learning, or ‘transfer’ in a broad sense: you train on X, and Y gets better!
I’m not saying that the knowledge doesn’t transfer, I’m saying it would seem weird if it transferred sharply. Specifically, if task Z is composed of performing task X then task Y, I would expect improving X to improve Z, and I would expect improving Y to improve Z, and I would expect P(Z performed correctly) to be given by P(X performed correctly) and P(Y performed correctly). I think that means Z will improve a bit more sharply than either X or Y, but not drastically so?
But I could absolutely be wrong here! Real models do things undreamt of in theory.
But we have plenty of evidence that how you weight or group data would change the dynamics and capabilities quantitatively and qualitatively … it’s not merely that MoEs seem to do slightly better on memorization-heavy benchmarks than reasoning ones, it’s that the meta-learning doesn’t happen at all!
The first part is what I’m hoping for: I want it to have different dynamics and capabilities, at least at intermediate stages… it’s fine if it eventually gets to the same place.
The second part would definitely be bad, if only because it’s a heavy alignment tax and if this incurs a large tax it’s a non-starter. Thanks for your intuition around this!
I would hazard the guess that it might learn the suppressed capabilities relatively rapidly. This would be very bad for safety purposes if you thought you trained a safe model you could release publicly, say, which did all sorts of useful things but couldn’t be made to do dangerous new things; and yet all you did was create a capabilities overhang for the first person to come along to unlock by finetuning.
That indeed seems bad. And to make sure I’ve got it right, the intuition here is that the model strongly “wants” to learn the suppressed features (because they’re very instrumental on the simple loss)? I guess the other thing that could happen is that you’ve screwed the model up too badly by training it on this grouped loss, so that those features are really far out of reach. I’m not quite sure how to think about this.
My takeaway is that to the extent this helps with safety, it’s a brittle strategy, and it has a good chance of incurring too-large a performance penalty to be viable in a competitive world.