I really like your ambitious MI section and I think you hit on a few interesting questions I’ve come across elsewhere:
Two researchers interpreted a 1-layer transformer network and then I interpreted it differently—there isn’t a great way to compare our explanations (or really know how similar vs different our explanations are).
With papers like the Hydra effect that demonstrate similar knowledge can be spread throughout a network, it’s not clear to if we want to/how to analyze impact—can/should we jointly ablate multiple units across different heads at once?
I’m personally unsure how to split my time between interpreting small networks vs larger ones. Should I focus 100% on interpreting 1-2 layer TinyStories LMs or is looking into 16+ layer LLMs valuable at this time?
I don’t have a good answer here unfortunately. My guess is (as I say above) the most important thing is to push forward on the quality of explanations and not the size?
I really like your ambitious MI section and I think you hit on a few interesting questions I’ve come across elsewhere:
Two researchers interpreted a 1-layer transformer network and then I interpreted it differently—there isn’t a great way to compare our explanations (or really know how similar vs different our explanations are).
With papers like the Hydra effect that demonstrate similar knowledge can be spread throughout a network, it’s not clear to if we want to/how to analyze impact—can/should we jointly ablate multiple units across different heads at once?
I’m personally unsure how to split my time between interpreting small networks vs larger ones. Should I focus 100% on interpreting 1-2 layer TinyStories LMs or is looking into 16+ layer LLMs valuable at this time?
I don’t have a good answer here unfortunately. My guess is (as I say above) the most important thing is to push forward on the quality of explanations and not the size?