One barrier to SAE circuits is that it’s currently hard to understand how attention out SAE latents are calculated. Even if you do IG attribution patching to try to understand which earlier latents are relevant to the attention out SAE latents, it doesn’t tell you how these latents interact inside the attention layer at all.
One barrier to SAE circuits is that it’s currently hard to understand how attention out SAE latents are calculated. Even if you do IG attribution patching to try to understand which earlier latents are relevant to the attention out SAE latents, it doesn’t tell you how these latents interact inside the attention layer at all.