Overall, I really like this post. I think it’s a cool, self-contained insight with real updates for interp. I also admire how quickly you got these results. It makes me want to hack more things, quickly, and get more cool results, quickly.
Models can be deeply understood: More fundamentally, this is further evidence that neural networks are genuinely understandable and interpretable, if we can just learn to speak their language.
I agree that this is evidence, but I have some sense of “there’s going to be low-hanging, truly-understandable circuits, and possibly a bunch of circuits we don’t understand and can’t even realize are there. And we keep doing interp work and understanding more and more of models, but often we won’t know exactly what we don’t know.” Are you sympathetic to this concern?
(Ofc you don’t need to understand a full net for interp to be amazingly useful, and other such caveats)
Also, what does “Translate by X” mean in your intervention plots?
Thanks! I also feel more optimistic now about speed research :) (I’ve tried similar experiments since, but with much less success—there’s a bunch of contingent factors around not properly hitting flow and not properly clearing time for it though). I’d be excited to hear what happens if you try it! Though I should clarify that writing up the results took a month of random spare non-work time...
Re models can be deeply understood, yes, I think you raise a valid and plausible concern and I agree that my work is not notable evidence against. Though also, idk man, it seems basically unfalsifiable. My intuition is that there may be some threshold of “we cannot deeply interpret past this”, but no one knows where it is (and most people assumed “we cannot deeply interpret at all”! Or something similar). And that every interpretability win is evidence that boundary is further on (or non-existent).
Fuzzy intuition: It doesn’t distinguish between the boundary being far away vs non-existent, but IMO the correct prior before seeing mech interp work at all was to have some distribution over the point where we hit a wall, and some probability on never hitting a wall. The longer we go without hitting a wall, the higher the posterios probability on never hitting a wall should be.
Translate by X is bad notation—it means “take the coordinate in the “mine vs their’s” direction, and set it to -X times its original value”. It should really be flip and scale by X or something (it came from an initial iteration of the method).
Overall, I really like this post. I think it’s a cool, self-contained insight with real updates for interp. I also admire how quickly you got these results. It makes me want to hack more things, quickly, and get more cool results, quickly.
I agree that this is evidence, but I have some sense of “there’s going to be low-hanging, truly-understandable circuits, and possibly a bunch of circuits we don’t understand and can’t even realize are there. And we keep doing interp work and understanding more and more of models, but often we won’t know exactly what we don’t know.” Are you sympathetic to this concern?
(Ofc you don’t need to understand a full net for interp to be amazingly useful, and other such caveats)
Also, what does “Translate by X” mean in your intervention plots?
Thanks! I also feel more optimistic now about speed research :) (I’ve tried similar experiments since, but with much less success—there’s a bunch of contingent factors around not properly hitting flow and not properly clearing time for it though). I’d be excited to hear what happens if you try it! Though I should clarify that writing up the results took a month of random spare non-work time...
Re models can be deeply understood, yes, I think you raise a valid and plausible concern and I agree that my work is not notable evidence against. Though also, idk man, it seems basically unfalsifiable. My intuition is that there may be some threshold of “we cannot deeply interpret past this”, but no one knows where it is (and most people assumed “we cannot deeply interpret at all”! Or something similar). And that every interpretability win is evidence that boundary is further on (or non-existent).
Fuzzy intuition: It doesn’t distinguish between the boundary being far away vs non-existent, but IMO the correct prior before seeing mech interp work at all was to have some distribution over the point where we hit a wall, and some probability on never hitting a wall. The longer we go without hitting a wall, the higher the posterios probability on never hitting a wall should be.
Translate by X is bad notation—it means “take the coordinate in the “mine vs their’s” direction, and set it to -X times its original value”. It should really be flip and scale by X or something (it came from an initial iteration of the method).