Over the last few months, the EleutherAI interpretability team pioneered novel, mechanistic methods for detecting anomalous behavior in language models based on Neel Nanda’s attribution patching technique. Unfortunately, none of these methods consistently outperform non-mechanistic baselines which look only at activations.
We find that we achieve better anomaly detection performance with methods that evaluate entire batches of test data, rather than considering test points one at a time. We achieve very good performance on many, but not all tasks we looked at.
We also find that it is relatively easy to detect adversarial examples in image classifiers with off-the-shelf techniques, although we did not test whether our anomaly detectors are themselves adversarially robust.
Thanks to @David Johnston and Arkajyoti Chakraborty for all their hard work on this project, as well as @Erik Jenner for fruitful discussion, ideas, and code!
Mechanistic Anomaly Detection Research Update
Link post
Over the last few months, the EleutherAI interpretability team pioneered novel, mechanistic methods for detecting anomalous behavior in language models based on Neel Nanda’s attribution patching technique. Unfortunately, none of these methods consistently outperform non-mechanistic baselines which look only at activations.
We find that we achieve better anomaly detection performance with methods that evaluate entire batches of test data, rather than considering test points one at a time. We achieve very good performance on many, but not all tasks we looked at.
We also find that it is relatively easy to detect adversarial examples in image classifiers with off-the-shelf techniques, although we did not test whether our anomaly detectors are themselves adversarially robust.
Thanks to @David Johnston and Arkajyoti Chakraborty for all their hard work on this project, as well as @Erik Jenner for fruitful discussion, ideas, and code!
Code: https://github.com/EleutherAI/cupbearer/tree/attribution_detector