Regarding image models: our understanding is that strong regularization is required to split representations for MNIST autoencoding and CIFAR classification because there is a strong inductive bias towards learning features that are common to many classes of images. (In MNIST, 3s are similar to 8s, etc.; In CIFAR, similar edge detectors, etc. will be learned for many classes.) Basically, our learning target is highly unnatural. With our current experimental design, I don’t expect this to change with scale, so I’m less excited about investigating the effect of model or dataset size. That said, this dynamic might change if we explored examples with class imbalance (routing only a small fraction of classes and training on others as normal). I suspect this would reduce the need for regularization, leading to a reduction in alignment tax and perhaps more interesting dynamics with respect to scale. That’s an experiment we probably should have run (and still could, but we aren’t prioritizing image models right now).
As for localization for unlearning in language models, my personal take is that the idea is there but we don’t have the method quite right yet. I think there’s a reasonable chance (say, 40%) that we change our configuration a bit and are able to get localization much more stably, and with lower alignment tax both pre- and post-ablation. (If I understand correctly, my colleagues agree that this outcome is plausible but think it’s less likely than I do.) If we aren’t able to find this methodological improvement, then I don’t see a point in scaling. However, if we find it, then I expect scaling will be relatively cheap because, while we will still need to pre-train models, we won’t need to do any more hyperparameter tuning than is usual. Of course, whatever method we land on may turn out to have middling performance. In that case, to get a signal on whether this is worth doing, we may need to investigate a realistic unlearning setting, where the model and data are larger, and the forget set is a smaller portion of the training data.
In terms of improvements that we’re trying: we’re currently thinking about (a) insights we can borrow from mixture of experts models, and (b) about whether it is better to route only via edges leaving parameters rather than activations; the latter is what we currently do, and is far more aggressive.
I’m not sure if any of our ambitious alignment goals can be achieved via fine-tuning. Once the model has “settled on” certain ways of representing concepts, it seems too late to do the kinds of things we want.[1] But this may just be a lack of imagination! Given that PEFT can be viewed as a special case of gradient routing, maybe there’s something there.
We (led by Jacob) tried a variety of things to get Expand, Route, Ablate to work as a fine-tuning method for unlearning. Unsurprisingly, we weren’t able to get it to work.
I agree that there are inductive biases towards sharing features and/or components. I’m not sure if there’s a good study of what features would be of this sort, vs. which others might actually benefit from being more seperate[1], and I’m not sure how you would do it effectively for a truly broad set of features nor if it would necessarily be that useful anyways, so I tend to just take this on vibes since it’s pretty intuitive based on our own perception of i.e. shapes. That said there are plenty of categories/tasks/features, which I would expect are kinda seperable after some point. Specifically, anything where humans have already applied some sort of division of labor to, like software features vs. biology knowledge features vs. creative writing features, etc… (in the setting of natural language). Obviously, these all might share some basic grammatical or structural core features, but one layer of abstraction up it feels natural that they should be seperable. All this goes to say is that it seems like a good idea to give gradient routing the best possible shot at success might be to try some such partitioning of features/tasks,[2] because unlike 3 and 8 we have some prior reason to believe that they should indeed be rather seperate. Maybe there’s other sources that can motivate what features or tasks to try to route seperately with minimal loss of utility (i.e. like what MoE papers report works well or not) but I haven’t thought about it too much.
One downside here is that all these examples that come to mind are in language settings, and so to get reasonable utiliy to start with you would probably need to be in the 1B-7B model size range.
About the edges. Have you tried all 3 combinations (route both, route one, route the other)? I think the fact that you limit to these edges in mentioned in the appendix Memory section. Surely, routing on activation edges is not actually prohibitive. Worst-case you you can just assign blocks to each category and it’ll basically just be an MoE. This really is just mathematically equivalent to MoE with a specific choice of architecture[3] right? One idea I had vaguely a while ago but it seems rather complicated is to do alternating dense training with MoE-ffication. In the dense training phases you train densly like usual. Then, you use some clever algorithm (think: like interpretability methods on steroids) to decide which parts of the network will get which experts. Then, in the MoE-ffication phase you use the clever algorithm to basically define routes/prune edges (for the chosen partitioning). You go back and repeat. Each expert is somewhat analogous its own network so each new iteration you split further and further. The goal is to get as much splitting for minimal utility cost possible. The resulting model might be smaller/cheaper at inference time and more interpretable. I’m not honestly sure how useful this might be, but I thought it was kind of cool :P
Really all I care about here is that these features don’t lose too much from being seperate. With that said, I guess some features may benefit from being seperated at training time if the train set has some spurious correlations, which it probably does.
Thanks for the thoughtful questions.
Regarding image models: our understanding is that strong regularization is required to split representations for MNIST autoencoding and CIFAR classification because there is a strong inductive bias towards learning features that are common to many classes of images. (In MNIST, 3s are similar to 8s, etc.; In CIFAR, similar edge detectors, etc. will be learned for many classes.) Basically, our learning target is highly unnatural. With our current experimental design, I don’t expect this to change with scale, so I’m less excited about investigating the effect of model or dataset size. That said, this dynamic might change if we explored examples with class imbalance (routing only a small fraction of classes and training on others as normal). I suspect this would reduce the need for regularization, leading to a reduction in alignment tax and perhaps more interesting dynamics with respect to scale. That’s an experiment we probably should have run (and still could, but we aren’t prioritizing image models right now).
As for localization for unlearning in language models, my personal take is that the idea is there but we don’t have the method quite right yet. I think there’s a reasonable chance (say, 40%) that we change our configuration a bit and are able to get localization much more stably, and with lower alignment tax both pre- and post-ablation. (If I understand correctly, my colleagues agree that this outcome is plausible but think it’s less likely than I do.) If we aren’t able to find this methodological improvement, then I don’t see a point in scaling. However, if we find it, then I expect scaling will be relatively cheap because, while we will still need to pre-train models, we won’t need to do any more hyperparameter tuning than is usual. Of course, whatever method we land on may turn out to have middling performance. In that case, to get a signal on whether this is worth doing, we may need to investigate a realistic unlearning setting, where the model and data are larger, and the forget set is a smaller portion of the training data.
In terms of improvements that we’re trying: we’re currently thinking about (a) insights we can borrow from mixture of experts models, and (b) about whether it is better to route only via edges leaving parameters rather than activations; the latter is what we currently do, and is far more aggressive.
I’m not sure if any of our ambitious alignment goals can be achieved via fine-tuning. Once the model has “settled on” certain ways of representing concepts, it seems too late to do the kinds of things we want.[1] But this may just be a lack of imagination! Given that PEFT can be viewed as a special case of gradient routing, maybe there’s something there.
We (led by Jacob) tried a variety of things to get Expand, Route, Ablate to work as a fine-tuning method for unlearning. Unsurprisingly, we weren’t able to get it to work.
I agree that there are inductive biases towards sharing features and/or components. I’m not sure if there’s a good study of what features would be of this sort, vs. which others might actually benefit from being more seperate[1], and I’m not sure how you would do it effectively for a truly broad set of features nor if it would necessarily be that useful anyways, so I tend to just take this on vibes since it’s pretty intuitive based on our own perception of i.e. shapes. That said there are plenty of categories/tasks/features, which I would expect are kinda seperable after some point. Specifically, anything where humans have already applied some sort of division of labor to, like software features vs. biology knowledge features vs. creative writing features, etc… (in the setting of natural language). Obviously, these all might share some basic grammatical or structural core features, but one layer of abstraction up it feels natural that they should be seperable. All this goes to say is that it seems like a good idea to give gradient routing the best possible shot at success might be to try some such partitioning of features/tasks,[2] because unlike 3 and 8 we have some prior reason to believe that they should indeed be rather seperate. Maybe there’s other sources that can motivate what features or tasks to try to route seperately with minimal loss of utility (i.e. like what MoE papers report works well or not) but I haven’t thought about it too much.
One downside here is that all these examples that come to mind are in language settings, and so to get reasonable utiliy to start with you would probably need to be in the 1B-7B model size range.
About the edges. Have you tried all 3 combinations (route both, route one, route the other)? I think the fact that you limit to these edges in mentioned in the appendix Memory section. Surely, routing on activation edges is not actually prohibitive. Worst-case you you can just assign blocks to each category and it’ll basically just be an MoE. This really is just mathematically equivalent to MoE with a specific choice of architecture[3] right? One idea I had vaguely a while ago but it seems rather complicated is to do alternating dense training with MoE-ffication. In the dense training phases you train densly like usual. Then, you use some clever algorithm (think: like interpretability methods on steroids) to decide which parts of the network will get which experts. Then, in the MoE-ffication phase you use the clever algorithm to basically define routes/prune edges (for the chosen partitioning). You go back and repeat. Each expert is somewhat analogous its own network so each new iteration you split further and further. The goal is to get as much splitting for minimal utility cost possible. The resulting model might be smaller/cheaper at inference time and more interpretable. I’m not honestly sure how useful this might be, but I thought it was kind of cool :P
Really all I care about here is that these features don’t lose too much from being seperate. With that said, I guess some features may benefit from being seperated at training time if the train set has some spurious correlations, which it probably does.
Unlearning virology, if you do want cellular biology, seems like the hardest possible task ngl.
You might be weight-sharing or smth.