Imagine all possible programs that implement a particular functionality. Imagine we have a neural network that implements this functionality. If we have perfect mechanistic interpretability we can extract the algorithms of a neural network that implements that functionality. But what kind of program do we get? Maybe there are multiple qualitatively different algorithms that all implement the functionality. Some of them would be much easier to understand for human. The algorithm the neural network finds might not be that program that is easiest to understand to a human.
Do you think this might be a significant obstacle in the future? For example, do you think it is likely that the algorithms inside of an AGI-neural-network built by SGD will be so complicated that they are not humanly understandable, because of their sheer size? I am especially thinking about the case where an algorithm exists that is just as capable but understandable.
This seems more likely if we end up with an AGI-neural-network that mushes together the world model and the algorithms that use the world model (e.g. update it, use it to plan), such that there are no clear boundaries. If the AGI is really good at manipulating the world, it probably has a pretty good model of the world. As the world contains a lot of algorithmic information, the AGI’s model of the world will be complex. If the system is mushed we might need to understand all that complexity to an intractable extent.
I expect that if you can have a system where the world model is factored out into its own module, it will be easier to handle the complexity in the world because then we can infer properties of the world model based on the algorithms that construct and use it. I expect the world model will still be very complex, and the algorithms that construct and use it will be simple. Therefore infering properties of the world model based on these simple algorithms might still be tractable.
Do you think this problem is likely to show up in the future?
Upon reflection, I’m unsure what you mean by the program being simpler. What is your preferred way to represent modular addition? I could of course write down 20 % 11. I know exactly what that means. But first of all, this is not an algorithm. It just talks about the concept of modular arithmetic without specifying how to compute it. And understanding the concept at a high level is of course easier than representing the entire algorithm all at once in my mind.
I guess the normal way you would compute the modulo would be to take a number a and then subtract b from it until what is left is smaller than b. What is left is then the modulo. Ok, that seems simpler so never mind.
It does seem an important distinction to think about the way we represent a concept and the actual computation associated with obtaining the results associated with that concept. I got confused because I was conflating these two things.
I feel sceptical about interpretability primarily because imagine that you have neural network that does useful superintelligent things because “cares about humans”. We have found Fourier transform in modular addition network because we already knew what Fourier transform is. But we have veeeery limited understanding of what “caring about humans” is from the math position.
Imagine all possible programs that implement a particular functionality. Imagine we have a neural network that implements this functionality. If we have perfect mechanistic interpretability we can extract the algorithms of a neural network that implements that functionality. But what kind of program do we get? Maybe there are multiple qualitatively different algorithms that all implement the functionality. Some of them would be much easier to understand for human. The algorithm the neural network finds might not be that program that is easiest to understand to a human.
Seems clearly true, the Fourier Multiplication Algorithm for modular addition is not the program easiest for me to understand to perform modular addition!
Do you think this might be a significant obstacle in the future? For example, do you think it is likely that the algorithms inside of an AGI-neural-network built by SGD will be so complicated that they are not humanly understandable, because of their sheer size? I am especially thinking about the case where an algorithm exists that is just as capable but understandable.
This seems more likely if we end up with an AGI-neural-network that mushes together the world model and the algorithms that use the world model (e.g. update it, use it to plan), such that there are no clear boundaries. If the AGI is really good at manipulating the world, it probably has a pretty good model of the world. As the world contains a lot of algorithmic information, the AGI’s model of the world will be complex. If the system is mushed we might need to understand all that complexity to an intractable extent.
I expect that if you can have a system where the world model is factored out into its own module, it will be easier to handle the complexity in the world because then we can infer properties of the world model based on the algorithms that construct and use it. I expect the world model will still be very complex, and the algorithms that construct and use it will be simple. Therefore infering properties of the world model based on these simple algorithms might still be tractable.
Do you think this problem is likely to show up in the future?
Upon reflection, I’m unsure what you mean by the program being simpler. What is your preferred way to represent modular addition? I could of course write down
20 % 11
. I know exactly what that means. But first of all, this is not an algorithm. It just talks about the concept of modular arithmetic without specifying how to compute it. And understanding the concept at a high level is of course easier than representing the entire algorithm all at once in my mind.I guess the normal way you would compute the modulo would be to take a number a and then subtract b from it until what is left is smaller than b. What is left is then the modulo. Ok, that seems simpler so never mind.
It does seem an important distinction to think about the way we represent a concept and the actual computation associated with obtaining the results associated with that concept. I got confused because I was conflating these two things.
I feel sceptical about interpretability primarily because imagine that you have neural network that does useful superintelligent things because “cares about humans”. We have found Fourier transform in modular addition network because we already knew what Fourier transform is. But we have veeeery limited understanding of what “caring about humans” is from the math position.
IIRC @Nora Belrose is either studying this right now or would know who is.