“Can we control the thought and behavior patterns of a powerful mind at all?”-I do not see why this would not be the case. For example, in a neural network, if we are able to find a cluster of problematic neurons, then we will be able to remove those neurons. With that being said, I do not know how well this works in practice. After removing the neurons (and normalizing so that the remaining neurons are given higher weights), if we do not retrain the neural network, then it could exhibit more unexpected or poor behavior. If we do retrain the network, then the network could regrow the problematic neurons. Furthermore, if we continually remove problematic neuron clusters, then the neural network could become less interpretable. The process of detecting and removing problematic neuron clusters will be a selective pressure that will cause the neurons to either behave well or behave poorly but evade detection. One solution to this problem may be to employ several different techniques for detecting problematic neuron clusters so that it is harder for these problematic clusters to evade detection. Of course, there may still be problematic neuron clusters that evade detection. But these problematic neuron clusters will probably be much less effective at behaving problematically since these problematic neuron clusters would need to trade performance for the ability to evade detection. For example, the process of detecting problematic neuron clusters could detect large problematic neuron clusters, but small problematic neuron clusters could avoid detection. In this case, the small problematic neuron clusters would be less effective and less worrisome simply because smaller neuron clusters would have a more difficult time causing problems.
There is a causal relationship between time on LW and frequency of paragraph breaks :P
Anyhow, I broadly agree with this comment, but I’d say it’s also an illustration of why interpretability has diminishing returns and we really need to also be doing “positive alignment.” If you just define some bad behaviors and ablate neurons associated with those bad behaviors (or do other things like filter the AI’s output), this can make your AI safer but with ~exponentially diminishing returns on the selection pressure you apply.
What we’d also like to be doing is defining good behaviors and helping the AI develop novel capabilities to pursue those good behaviors. This is trickier because maybe you can’t just jam the internet at self-supervised learning to do it, so it has more bits that look like the “classic” alignment problem.
I agree that black box alignment research (where we do not look at what the hidden layers are doing) is crucial for AI and AGI safety.
I just personally am more interested in interpretability than direct alignment because I think I am currently better at making interpretable machine learning models and interpretability tools and because I can make my observations rigorous enough for anyone who is willing to copy my experiments or read my proofs to be convinced. This just may be more to do with my area of expertise than any objective value in the importance of interpretability vs black box alignment.
Can you elaborate on what you mean by ‘exponentially diminishing returns’? I don’t think I fully get that or why that may be the case.
If you start with an AI that makes decisions of middling quality, how well can you get it to make high-quality decisions by ablating neurons associated with bad decisions? This is the centeal thing I expect to have diminishing returns (though it’s related to some other uses of unterpretability that might also have diminishing returns).
If you take a predictive model of chess games trained on human play, it’s probably not too hard to get it to play near the 90th percentile of the dataset. But it’s not going to play as well as stockfish almost no matter what you do. The AI is a bit flexible, especially in ways the training data has prediction-relevant variation, but it’s not arbitrarily flexible, and once you’ve changed the few most important neurons the other neurons will be progressively less important. I expect this to show up for all sorts of properties (e.g. moral quality of decisions), not just chess skill.
“Can we control the thought and behavior patterns of a powerful mind at all?”-I do not see why this would not be the case. For example, in a neural network, if we are able to find a cluster of problematic neurons, then we will be able to remove those neurons. With that being said, I do not know how well this works in practice. After removing the neurons (and normalizing so that the remaining neurons are given higher weights), if we do not retrain the neural network, then it could exhibit more unexpected or poor behavior. If we do retrain the network, then the network could regrow the problematic neurons. Furthermore, if we continually remove problematic neuron clusters, then the neural network could become less interpretable. The process of detecting and removing problematic neuron clusters will be a selective pressure that will cause the neurons to either behave well or behave poorly but evade detection. One solution to this problem may be to employ several different techniques for detecting problematic neuron clusters so that it is harder for these problematic clusters to evade detection. Of course, there may still be problematic neuron clusters that evade detection. But these problematic neuron clusters will probably be much less effective at behaving problematically since these problematic neuron clusters would need to trade performance for the ability to evade detection. For example, the process of detecting problematic neuron clusters could detect large problematic neuron clusters, but small problematic neuron clusters could avoid detection. In this case, the small problematic neuron clusters would be less effective and less worrisome simply because smaller neuron clusters would have a more difficult time causing problems.
There is a causal relationship between time on LW and frequency of paragraph breaks :P
Anyhow, I broadly agree with this comment, but I’d say it’s also an illustration of why interpretability has diminishing returns and we really need to also be doing “positive alignment.” If you just define some bad behaviors and ablate neurons associated with those bad behaviors (or do other things like filter the AI’s output), this can make your AI safer but with ~exponentially diminishing returns on the selection pressure you apply.
What we’d also like to be doing is defining good behaviors and helping the AI develop novel capabilities to pursue those good behaviors. This is trickier because maybe you can’t just jam the internet at self-supervised learning to do it, so it has more bits that look like the “classic” alignment problem.
I agree that black box alignment research (where we do not look at what the hidden layers are doing) is crucial for AI and AGI safety.
I just personally am more interested in interpretability than direct alignment because I think I am currently better at making interpretable machine learning models and interpretability tools and because I can make my observations rigorous enough for anyone who is willing to copy my experiments or read my proofs to be convinced. This just may be more to do with my area of expertise than any objective value in the importance of interpretability vs black box alignment.
Can you elaborate on what you mean by ‘exponentially diminishing returns’? I don’t think I fully get that or why that may be the case.
If you start with an AI that makes decisions of middling quality, how well can you get it to make high-quality decisions by ablating neurons associated with bad decisions? This is the centeal thing I expect to have diminishing returns (though it’s related to some other uses of unterpretability that might also have diminishing returns).
If you take a predictive model of chess games trained on human play, it’s probably not too hard to get it to play near the 90th percentile of the dataset. But it’s not going to play as well as stockfish almost no matter what you do. The AI is a bit flexible, especially in ways the training data has prediction-relevant variation, but it’s not arbitrarily flexible, and once you’ve changed the few most important neurons the other neurons will be progressively less important. I expect this to show up for all sorts of properties (e.g. moral quality of decisions), not just chess skill.