lberglund comments on Eight Strategies for Tackling the Hard Part of the Alignment Problem

lberglund 21 Jul 2023 12:28 UTC
LW: 3 AF: 2
2
AF
I’m interested in the relation between mechanistic anomaly detection and distillation. In theory, if we have a distilled model, we could use it for mechanistic anomaly detection: for each input x, we would check the degree to which the original model’s output differs from the distilled model. If the difference is too great, we flag it as an anomaly and reject the output.
Let’s say you have your original model $M$ and your distilled model $m$ along with some function $d$ to quantify the difference between two outputs. If you are doing distillation, you would always just output $m (x)$ . If you are doing mechanistic anomaly detection, you output $M (x)$ if $d (M (x) - m (x))$ is below some threshold and you output nothing otherwise. Here, I can see three differences between distillation and mechanistic anomaly detection:
- Distillation is cheaper, since you only have to run the distilled model, rather than both models.
- Mechanistic anomaly detection might give you higher quality outputs when $d (M (x), m (x))$ is below some threshold, since you are using the outputs from $M$ rather than those from $m$ .
- Distillation will always return an output, whereas mechanistic anomaly detection will reject some.
- Mechanistic anomaly detection operates according to a discrete threshold, whereas distillation is more continuous.
Overall, distillation just seems better than mechanistic anomaly detection in this case? Of course mechanistic anomaly detection could be done without a distilled model,^[1] but whenever you have a distilled model, it seems beneficial to just use it rather than running mechanistic anomaly detection.
1. ^
  E.g. you observe that two neurons of the network always fire together and you flag it as an anomaly when they don’t.
- Oliver Daniels 21 Sep 2023 2:43 UTC
  2 points
  0
  Parent
  Under this definition of mechanistic anomaly detection, I agree pure distillation just seems better. But part of the hope of mechanistic anomaly detection is to reduce the false positive rate (and thus the alignment tax) by only flagging examples produced by different most-proximate reasons. In some sense this may be considered increasing the safe threshold for $d (M (x) - m (x))$ , such that mechanistic anomaly detection is worth it all things considered.
- scasper 21 Jul 2023 16:58 UTC
  LW: 2 AF: 2
  1
  AF Parent
  Thanks—I agree that this seems like an approach worth doing. I think that at CHAI and/or Redwood there is a little bit of work at least related to this, but don’t quote me on that. In general, it seems like if you have a model and then a smaller distilled/otherwise-compressed version of it, there is a lot you can do with them from an alignment perspective. I am not sure how much work has been done in the anomaly detection literature that involves distillation/compression.