Mechanistic Anomaly Detection Research Update

Link post

Over the last few months, the EleutherAI interpretability team pioneered novel, mechanistic methods for detecting anomalous behavior in language models based on Neel Nanda’s attribution patching technique. Unfortunately, none of these methods consistently outperform non-mechanistic baselines which look only at activations.

We find that we achieve better anomaly detection performance with methods that evaluate entire batches of test data, rather than considering test points one at a time. We achieve very good performance on many, but not all tasks we looked at.

We also find that it is relatively easy to detect adversarial examples in image classifiers with off-the-shelf techniques, although we did not test whether our anomaly detectors are themselves adversarially robust.

Thanks to @David Johnston and Arkajyoti Chakraborty for all their hard work on this project, as well as @Erik Jenner for fruitful discussion, ideas, and code!

Code: https://​​github.com/​​EleutherAI/​​cupbearer/​​tree/​​attribution_detector

No comments.