I think unlearning could be a good fit for automated alignment research.
Unlearning could be a very general tool to address a lot of AI threat models. It might be possible to unlearn deception, scheming, manipulation of humans, cybersecurity, etc. I challenge you to come up with an AI safety failure story that can’t, in principle, be countered through targeted unlearning in some way, shape, or form.
Relative to some other kinds of alignment research, unlearning seems easy to automate, since you can optimize metrics for how well things have been unlearned.
Unlearning could be a very general tool to address a lot of AI threat models. It might be possible to unlearn deception, scheming, manipulation of humans, cybersecurity, etc. I challenge you to come up with an AI safety failure story that can’t, in principle, be countered through targeted unlearning in some way, shape, or form.
The big one probably has to do with being able to corrupt the metrics so totally that whatever you think you made them unlearn actually didn’t happen, or just being able to relearn the knowledge so fast that unlearning doesn’t matter, but yes unlearning is a very underrated direction for AI automation, because it targets so many threat models.
It also satisfies the property of addressing a bottleneck (in this case, capabilities being so dangerous as to threaten any test), and while I wouldn’t call it the best, it’s still quite underrated how much unlearning will be useful.
Similarly, domain-limiting AIs would be quite useful for control of AI.
The big one probably has to do with being able to corrupt the metrics so totally that whatever you think you made them unlearn actually didn’t happen, or just being able to relearn the knowledge so fast that unlearning doesn’t matter
I favor proactive approaches to unlearning which prevent the target knowledge from being acquired in the first place. E.g. for gradient routing, if you can restrict “self-awareness and knowledge of how to corrupt metrics” to a particular submodule of the network during learning, then if that submodule isn’t active, you can be reasonably confident that the metrics aren’t currently being corrupted. (Even if that submodule sandbags and underrates its own knowledge, that should be fine if the devs know to be wary of it. Just ablate that submodule whenever you’re measuring something that matters, regardless of whether your metrics say it knows stuff!)
I think unlearning could be a good fit for automated alignment research.
Unlearning could be a very general tool to address a lot of AI threat models. It might be possible to unlearn deception, scheming, manipulation of humans, cybersecurity, etc. I challenge you to come up with an AI safety failure story that can’t, in principle, be countered through targeted unlearning in some way, shape, or form.
Relative to some other kinds of alignment research, unlearning seems easy to automate, since you can optimize metrics for how well things have been unlearned.
I like this post.
The big one probably has to do with being able to corrupt the metrics so totally that whatever you think you made them unlearn actually didn’t happen, or just being able to relearn the knowledge so fast that unlearning doesn’t matter, but yes unlearning is a very underrated direction for AI automation, because it targets so many threat models.
It also satisfies the property of addressing a bottleneck (in this case, capabilities being so dangerous as to threaten any test), and while I wouldn’t call it the best, it’s still quite underrated how much unlearning will be useful.
Similarly, domain-limiting AIs would be quite useful for control of AI.
I favor proactive approaches to unlearning which prevent the target knowledge from being acquired in the first place. E.g. for gradient routing, if you can restrict “self-awareness and knowledge of how to corrupt metrics” to a particular submodule of the network during learning, then if that submodule isn’t active, you can be reasonably confident that the metrics aren’t currently being corrupted. (Even if that submodule sandbags and underrates its own knowledge, that should be fine if the devs know to be wary of it. Just ablate that submodule whenever you’re measuring something that matters, regardless of whether your metrics say it knows stuff!)
Some related thoughts here
Unlearning techniques should probably be battle-tested in low-stakes “model organism” type contexts, where metrics corruption isn’t expected.
Curious what areas you are most excited about!
I basically agree with this, and on this question:
My big areas of excitement are AI control (in a broad sense) and synthetic dataset making for AI alignment of successors.