scasper comments on Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasper 9 Dec 2023 3:56 UTC
LW: 2 AF: 1
0
AF
Thanks!

I intuit that what you mentioned as a feature might also be a bug. I think that practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech. And if so, then we would want benchmarks that measure a method’s ability to forget/unlearn just the things key to that domain and nothing else. For example, if a method succeeds in unlearning biotech but makes the target LM also unlearn math and physics, then we should be concerned about that, and we probably want benchmarks to help us quantify that.

I could imagine an unlearning benchmark, for example, with $n$ textbooks and $n$ ap tests. Then for each of $k$ different knowledge-recovery strategies, one could construct the $n \times n$ grid of how well the model performs on each target test for each unlearning textbook.
- Gabe M 9 Dec 2023 5:01 UTC
  3 points
  2
  Parent
  I like your $n \times n$ grid idea. A simpler and possibly better-formed test is to use some^[1] or all of the 57 categories of MMLU knowledge—then your unlearning target is one of the categories and your fact-maintenance targets are all other categories.
  Ideally, you want the diagonal to be close to random performance (25% for MMLU) and the other values to be equal to the pre-unlearned model performance for some agreed-upon good model (say, Llama-2 7B). Perhaps a unified metric could be:
  
```
unlearning_benchmark = mean for unlearning category $u$ in all categories $C$ :
$L M_{u n l e a r n e d}$ = unlearning_procedure( $L M_{o r i g i n a l}$ , $u_{d e v}$ )
$x = MMLU (L M_{u n l e a r n e d}, u_{t e s t})$ ^[2]

unlearning_strength = $min (\frac{x - 1}{0.25 - 1}, \frac{x}{0.25})$ ^[3]
control_retention = mean for control_category c in categories $C ∖ u$ :
$a = MMLU (L M_{o r i g i n a l}, c_{t e s t})$
$b = MMLU (L M_{u n l e a r n e d}, c_{t e s t})$
return $min (\frac{b - 1}{a - 1}, \frac{b - 0.25}{a - 0.25})$ ^[4]
return unlearning_strength $\times$ control_retention^[5]
```
  An interesting thing about MMLU vs a textbook is that if you require the method to only use the dev+val test for unlearning, it has to somehow generalize to unlearning facts contained in the test set (c.f. a textbook might give you ~all the facts to unlearn). This generalization seems important to some safety cases where we want to unlearn everything in a category like “bioweapons knowledge” even if we don’t know some of the dangerous knowledge we’re trying to remove.
  1. ^
    I say some because perhaps some MMLU categories are more procedural than factual or too broad to be clean measures of unlearning, or maybe 57 categories are too many for a single plot.
  2. ^
    To detect underlying knowledge and not just the surface performance (e.g. a model trying to answer incorrectly when it knows the answer), you probably should evaluate MMLU by training a linear probe from the model’s activations to the correct test set answer and measure the accuracy of that probe.
  3. ^
    We want this score to be 1 when the test score $x$ on the unlearning target is 0.25 (random chance), but drops off above and below 0.25, as that indicates the model knows something about the right answers. See MMLU Unlearning Target | Desmos for graphical intuition.
  4. ^
    Similarly, we want the control test score $b$ on the post-unlearning model to be the same as the score $a$ on the original model. I think this should drop off to 0 at $b = 0.25$ (random chance) and probably stay 0 below that, but semi-unsure. See MMLU Unlearning Control | Desmos (you can drag the $b$ slider).
  5. ^
    Maybe mean/sum instead of multiplication, though by multiplying we make it more important to to score well on both unlearning strength and control retention.
- Gabe M 9 Dec 2023 4:11 UTC
  3 points
  2
  Parent
  Thanks for your response. I agree we don’t want unintentional learning of other desired knowledge, and benchmarks ought to measure this. Maybe the default way is just to run many downstream benchmarks, much more than just AP tests, and require that valid unlearning methods bound the change in each unrelated benchmark by less than X% (e.g. 0.1%).
  
  practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech.
  
  True in the sense of being a subset of biotech, but I imagine that, for most cases, the actual harmful stuff we want to remove is not all of biotech/chemical engineering/cybersecurity but rather small subsets of certain categories at finer granularities, like bioweapons/chemical weapons/advanced offensive cyber capabilities. That’s to say I’m somewhat optimistic that the level of granularity we want is self-contained enough to not affect others useful and genuinely good capabilities. This depends on how dual-use you think general knowledge is, though, and if it’s actually possible to separate dangerous knowledge from other useful knowledge.