ryan_greenblatt – By mech interp I mean “A subfield of interpretability that uses bottom-up or reverse engineering approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding.”
That makes sense to me, and I think it is essential that we identify those low-level components. But I’ve got problems with the “working upward” part.
The low-level components of a gothic cathedral, for example, consist of things like stone blocks, wooden beams, metal hinges and clasps and so forth, pieces of colored glass for the windows, tiles for the roof, and so forth. How do you work upward from a pile of that stuff, even if neatly organized and thoroughly catalogues, how do you get from there to the overall design of the overall cathedral. How, for example, can you look at that and conclude, “this thing’s going to have flying buttresses to support the roof?”
Somewhere in “How the Mind Works” Steven Pinker makes the same point in explaining reverse engineering. Imagine you’re in an antique shop, he suggests, and you come across odd little metal contraption. It doesn’t make any sense at all. The shop keeper sees your bewilderment and offers, “That’s an olive pitter.” Now that contraption makes sense. You know what it’s supposed to do.
How are you going to make sense of those things you find under the hood unless you have some idea of what they’re supposed to do?
That makes sense to me, and I think it is essential that we identify those low-level components. But I’ve got problems with the “working upward” part.
The low-level components of a gothic cathedral, for example, consist of things like stone blocks, wooden beams, metal hinges and clasps and so forth, pieces of colored glass for the windows, tiles for the roof, and so forth. How do you work upward from a pile of that stuff, even if neatly organized and thoroughly catalogues, how do you get from there to the overall design of the overall cathedral. How, for example, can you look at that and conclude, “this thing’s going to have flying buttresses to support the roof?”
Somewhere in “How the Mind Works” Steven Pinker makes the same point in explaining reverse engineering. Imagine you’re in an antique shop, he suggests, and you come across odd little metal contraption. It doesn’t make any sense at all. The shop keeper sees your bewilderment and offers, “That’s an olive pitter.” Now that contraption makes sense. You know what it’s supposed to do.
How are you going to make sense of those things you find under the hood unless you have some idea of what they’re supposed to do?