My answer to this is actually tucked into one paragraph on the 10th page of the paper: “This type of approach is valuable...reverse engineering a system”. We cite examples of papers that have used interpretability tools to generate novel adversaries, aid in manually-finetuning a network to induce a predictable change, or reverse engineer a network. Here they are.
My answer to this is actually tucked into one paragraph on the 10th page of the paper: “This type of approach is valuable...reverse engineering a system”. We cite examples of papers that have used interpretability tools to generate novel adversaries, aid in manually-finetuning a network to induce a predictable change, or reverse engineer a network. Here they are.
Making adversaries:
https://distill.pub/2019/activation-atlas/
https://arxiv.org/abs/2110.03605
https://arxiv.org/abs/1811.12231
https://arxiv.org/abs/2201.11114
https://arxiv.org/abs/2206.14754
https://arxiv.org/abs/2106.03805
https://arxiv.org/abs/2006.14032
https://arxiv.org/abs/2208.08831
https://arxiv.org/abs/2205.01663
Manual fine-tuning:
https://arxiv.org/abs/2202.05262
https://arxiv.org/abs/2105.04857
Reverse engineering (I’d put an asterisk on these ones though because I don’t expect methods like this to scale well to non-toy problems):
https://www.lesswrong.com/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking
https://distill.pub/2020/circuits/curve-detectors/