Under this definition of mechanistic anomaly detection, I agree pure distillation just seems better. But part of the hope of mechanistic anomaly detection is to reduce the false positive rate (and thus the alignment tax) by only flagging examples produced by differentmost-proximate reasons. In some sense this may be considered increasing the safe threshold for d(M(x)−m(x)), such that mechanistic anomaly detection is worth it all things considered.
Under this definition of mechanistic anomaly detection, I agree pure distillation just seems better. But part of the hope of mechanistic anomaly detection is to reduce the false positive rate (and thus the alignment tax) by only flagging examples produced by different most-proximate reasons. In some sense this may be considered increasing the safe threshold for d(M(x)−m(x)), such that mechanistic anomaly detection is worth it all things considered.