By mech interp I mean “A subfield of interpretability that uses bottom-up or reverse engineering approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding.”
For examples of non-mech interp model internals, see here, here, and here. (Though all of these methods are quite simple.)
For examples of non-mech interp model internals, see here, here, and here. (Though all of these methods are quite simple.)