By mech interp I mean “A subfield of interpretability that uses bottom-up or reverse engineering approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding.”
For examples of non-mech interp model internals, see here, here, and here. (Though all of these methods are quite simple.)
What are these? I’m confused about the boundary between mechinterp and others.
For examples of non-mech interp model internals, see here, here, and here. (Though all of these methods are quite simple.)