Some say mechanistic interpretability seems really unlikely to bear any fruit because of one or both of the following
Neural networks are fundamentally impossible (or just very very hard) for humans to understand, like neuroscience or economics. Fully understanding complex systems is not a thing humans can do.
Neural networks are not doing anything you would want to understand, mostly shallow pattern matching and absurd statistical correlations, and it seems impossible to explain how that pattern matching is occurring or tease apart what the root cause of the statistical correlations are.
I think 1 is wrong, because mechanistic interpretability seems to have very fast feedback loops, and we are able to run a shit-ton of experiments. Humans are empirically great at understanding even the most complex of systems if they’re able to run a shit-ton of experiments.
For 2, I think the claim that neural networks are doing shallow pattern matching & absurd statistical correlations is true, and may continue to be true for really scary systems, but am still optimistic we’ll be able to understand why it uses the correlations it does. We have access to the causal system which produced the network (the gradient descent process), and it doesn’t seem too far a step to go from understanding raw networks, to tracing back parts of the network you don’t quite understand or you think are doing shallow pattern matching to the gradients which built those in, and the datapoints which led to those gradients.
Some say mechanistic interpretability seems really unlikely to bear any fruit because of one or both of the following
Neural networks are fundamentally impossible (or just very very hard) for humans to understand, like neuroscience or economics. Fully understanding complex systems is not a thing humans can do.
Neural networks are not doing anything you would want to understand, mostly shallow pattern matching and absurd statistical correlations, and it seems impossible to explain how that pattern matching is occurring or tease apart what the root cause of the statistical correlations are.
I think 1 is wrong, because mechanistic interpretability seems to have very fast feedback loops, and we are able to run a shit-ton of experiments. Humans are empirically great at understanding even the most complex of systems if they’re able to run a shit-ton of experiments.
For 2, I think the claim that neural networks are doing shallow pattern matching & absurd statistical correlations is true, and may continue to be true for really scary systems, but am still optimistic we’ll be able to understand why it uses the correlations it does. We have access to the causal system which produced the network (the gradient descent process), and it doesn’t seem too far a step to go from understanding raw networks, to tracing back parts of the network you don’t quite understand or you think are doing shallow pattern matching to the gradients which built those in, and the datapoints which led to those gradients.