The big one probably has to do with being able to corrupt the metrics so totally that whatever you think you made them unlearn actually didn’t happen, or just being able to relearn the knowledge so fast that unlearning doesn’t matter
I favor proactive approaches to unlearning which prevent the target knowledge from being acquired in the first place. E.g. for gradient routing, if you can restrict “self-awareness and knowledge of how to corrupt metrics” to a particular submodule of the network during learning, then if that submodule isn’t active, you can be reasonably confident that the metrics aren’t currently being corrupted. (Even if that submodule sandbags and underrates its own knowledge, that should be fine if the devs know to be wary of it. Just ablate that submodule whenever you’re measuring something that matters, regardless of whether your metrics say it knows stuff!)
I favor proactive approaches to unlearning which prevent the target knowledge from being acquired in the first place. E.g. for gradient routing, if you can restrict “self-awareness and knowledge of how to corrupt metrics” to a particular submodule of the network during learning, then if that submodule isn’t active, you can be reasonably confident that the metrics aren’t currently being corrupted. (Even if that submodule sandbags and underrates its own knowledge, that should be fine if the devs know to be wary of it. Just ablate that submodule whenever you’re measuring something that matters, regardless of whether your metrics say it knows stuff!)
Some related thoughts here
Unlearning techniques should probably be battle-tested in low-stakes “model organism” type contexts, where metrics corruption isn’t expected.
Curious what areas you are most excited about!
I basically agree with this, and on this question:
My big areas of excitement are AI control (in a broad sense) and synthetic dataset making for AI alignment of successors.