Daniel Tan comments on Arrakis—A toolkit to conduct, track and visualize mechanistic interpretability experiments.

Daniel Tan 17 Jul 2024 8:48 UTC
1 point
0
Really interesting! I’m a big proponent of improving the standards of infrastructure in the mech interp community.
Some questions:
- Have you used other things like TransformerLens and NNsight and found those to be insufficient in some way? Your library seems to diverge fundamentally from both of those implementations (pytorch hooks in the former case and “proxy variables” in the latter case). I’m curious about the motivating use case here.
- Do you have examples of reproducing specific mech interp analyses using your library? E.g. Neel Nanda’s Indirect Object Identification tutorial, or other simple things like doing activation patching / logit lens.
- Yash Srivastava 24 Jul 2024 7:10 UTC
  1 point
  0
  Parent
  Thanks a lot for the read. To answer your question :
  
  1. I am a regular user of Transformer Lens(not so much of NNSight), and one the things that bugged me a lot is lack of abstractions to do common operations (ablations, head compositions, model surgery etc) and thought of just implementing it. In terms of architecture, what I’ve planned is to have a similar outline to Meta’s Hydra - where you run your experiments from config files, and the library does the grunt work. I’m still open to ideas and have been in talking about it with people from OS community.
  
  2. In my docs, I have included example usage of all the tools that are working as of now(for supported models). There are example usage for common attention operations (merging/ablating heads) removing /permuting layers others such as sparsity analysis, polysemantic scores. I will try to push more heavy tutorials such as IOI ones in the near future.