Simon Lermen comments on Open Source Automated Interpretability for Sparse Autoencoder Features

Simon Lermen 31 Jul 2024 7:14 UTC
3 points
1
I think it might be interesting to note potential risks of deceptive models creating false or misleading labels for features. In general I think coming up with better and more robust automated labeling of features is an important direction.

I worked in a group at a recent hackathon on demonstrating the feasibility of creating bad labels in bills method. https://www.lesswrong.com/posts/PyzZ6gcB7BaGAgcQ7/deceptive-agents-can-collude-to-hide-dangerous-features-in