This is a great write up, thanks! Has their been any follow up from the paper’s authors?
This seems a pretty compelling takedown to me which is not addressed by the existing paper (my understanding of the two WordNet experiments not discussed in post is: Figure 4 concerns whether under whitening a concept can be linearly separated (yes) and so the random baseline used here does not address the concerns in this post; Figure 5 shows that the whitening transformation preserves some of the word net cluster cosine sim, but moreover on the right basically everything is orthogonal, as found in this post).
This seems important to me since the point of mech interp is to not be another interpretability field dominated by pretty pictures (e.g. saliency maps) that fail basic sanity checks (e.g. this paper for saliency maps). (Workshops aren’t too important, but I’m still surprised about this)
Thanks a lot! We had an email exchange with the authors and they shared some updated results with much better random shuffling controls on the WordNet hierarchy.
They also argue that some contexts should promote the likelihood of both “sad” and “joy” since they are causally separable, so they should not be expected to be anti-correlated under their causal inner product per se. We’re still concerned about what this means for semantic steering.
This is a great write up, thanks! Has their been any follow up from the paper’s authors?
This seems a pretty compelling takedown to me which is not addressed by the existing paper (my understanding of the two WordNet experiments not discussed in post is: Figure 4 concerns whether under whitening a concept can be linearly separated (yes) and so the random baseline used here does not address the concerns in this post; Figure 5 shows that the whitening transformation preserves some of the word net cluster cosine sim, but moreover on the right basically everything is orthogonal, as found in this post).
This seems important to me since the point of mech interp is to not be another interpretability field dominated by pretty pictures (e.g. saliency maps) that fail basic sanity checks (e.g. this paper for saliency maps). (Workshops aren’t too important, but I’m still surprised about this)
Thanks a lot! We had an email exchange with the authors and they shared some updated results with much better random shuffling controls on the WordNet hierarchy.
They also argue that some contexts should promote the likelihood of both “sad” and “joy” since they are causally separable, so they should not be expected to be anti-correlated under their causal inner product per se. We’re still concerned about what this means for semantic steering.