Over the last few months, the EleutherAI interpretability team pioneered novel, mechanistic methods for detecting anomalous behavior in language models based on Neel Nanda’s attribution patching technique. Unfortunately, none of these methods consistently outperform non-mechanistic baselines which look only at activations.

We find that we achieve better anomaly detection performance with methods that evaluate entire batches of test data, rather than considering test points one at a time. We achieve very good performance on many, but not all tasks we looked at.

We also find that it is relatively easy to detect adversarial examples in image classifiers with off-the-shelf techniques, although we did not test whether our anomaly detectors are themselves adversarially robust.

Thanks to @David Johnston and Arkajyoti Chakraborty for all their hard work on this project, as well as @Erik Jenner for fruitful discussion, ideas, and code!

Code: https://github.com/EleutherAI/cupbearer/tree/attribution_detector

Mechanistic Anomaly Detection Research Update

Nora Belrose and David Johnston

6 Aug 2024 10:33 UTC

11 points

0 comments1 min readLW link

Interpretability (ML & AI)AI Eliciting Latent Knowledge

What links here?

Shallow review of technical AI safety, 2024 by technicalities (29 Dec 2024 12:01 UTC; 201 points)

No comments.