I guess my biggest doubt is that a dl-based AI could run interpretability on itself. Large NNs seem to “simulate” a larger network to represent more features, which results in most of the weights occupying a superposition. I don’t see how a network could reflect on itself, since it seems that would require an even greater network (which then would require an even greater network, and so on). I don’t see how it could eat its own tail, since only interpreting parts of the network would not be enough. It would have to interpret the whole.
Uhm, by interpretability I mean things like this where the algorithm that the NN implements is revered engineered, written down as code or whatever which would allow for easier recursive self improvement (by improving just the code and getting rid of the spaghetti NN).
Also by the looks of things (induction heads and circuits in general) there does seem to be a sort of modularity in how NN learn, so it does seem likely that you can interpret piece by piece. If this wasn’t true I don’t think mechanistic interpretability as a field would even exist.
I guess my biggest doubt is that a dl-based AI could run interpretability on itself. Large NNs seem to “simulate” a larger network to represent more features, which results in most of the weights occupying a superposition. I don’t see how a network could reflect on itself, since it seems that would require an even greater network (which then would require an even greater network, and so on). I don’t see how it could eat its own tail, since only interpreting parts of the network would not be enough. It would have to interpret the whole.
Uhm, by interpretability I mean things like this where the algorithm that the NN implements is revered engineered, written down as code or whatever which would allow for easier recursive self improvement (by improving just the code and getting rid of the spaghetti NN).
Also by the looks of things (induction heads and circuits in general) there does seem to be a sort of modularity in how NN learn, so it does seem likely that you can interpret piece by piece. If this wasn’t true I don’t think mechanistic interpretability as a field would even exist.