“Can we control the thought and behavior patterns of a powerful mind at all?”-I do not see why this would not be the case. For example, in a neural network, if we are able to find a cluster of problematic neurons, then we will be able to remove those neurons. With that being said, I do not know how well this works in practice. After removing the neurons (and normalizing so that the remaining neurons are given higher weights), if we do not retrain the neural network, then it could exhibit more unexpected or poor behavior. If we do retrain the network, then the network could regrow the problematic neurons. Furthermore, if we continually remove problematic neuron clusters, then the neural network could become less interpretable. The process of detecting and removing problematic neuron clusters will be a selective pressure that will cause the neurons to either behave well or behave poorly but evade detection. One solution to this problem may be to employ several different techniques for detecting problematic neuron clusters so that it is harder for these problematic clusters to evade detection. Of course, there may still be problematic neuron clusters that evade detection. But these problematic neuron clusters will probably be much less effective at behaving problematically since these problematic neuron clusters would need to trade performance for the ability to evade detection. For example, the process of detecting problematic neuron clusters could detect large problematic neuron clusters, but small problematic neuron clusters could avoid detection. In this case, the small problematic neuron clusters would be less effective and less worrisome simply because smaller neuron clusters would have a more difficult time causing problems.
There is a causal relationship between time on LW and frequency of paragraph breaks :P
Anyhow, I broadly agree with this comment, but I’d say it’s also an illustration of why interpretability has diminishing returns and we really need to also be doing “positive alignment.” If you just define some bad behaviors and ablate neurons associated with those bad behaviors (or do other things like filter the AI’s output), this can make your AI safer but with ~exponentially diminishing returns on the selection pressure you apply.
What we’d also like to be doing is defining good behaviors and helping the AI develop novel capabilities to pursue those good behaviors. This is trickier because maybe you can’t just jam the internet at self-supervised learning to do it, so it has more bits that look like the “classic” alignment problem.
I agree that black box alignment research (where we do not look at what the hidden layers are doing) is crucial for AI and AGI safety.
I just personally am more interested in interpretability than direct alignment because I think I am currently better at making interpretable machine learning models and interpretability tools and because I can make my observations rigorous enough for anyone who is willing to copy my experiments or read my proofs to be convinced. This just may be more to do with my area of expertise than any objective value in the importance of interpretability vs black box alignment.
Can you elaborate on what you mean by ‘exponentially diminishing returns’? I don’t think I fully get that or why that may be the case.
If you start with an AI that makes decisions of middling quality, how well can you get it to make high-quality decisions by ablating neurons associated with bad decisions? This is the centeal thing I expect to have diminishing returns (though it’s related to some other uses of unterpretability that might also have diminishing returns).
If you take a predictive model of chess games trained on human play, it’s probably not too hard to get it to play near the 90th percentile of the dataset. But it’s not going to play as well as stockfish almost no matter what you do. The AI is a bit flexible, especially in ways the training data has prediction-relevant variation, but it’s not arbitrarily flexible, and once you’ve changed the few most important neurons the other neurons will be progressively less important. I expect this to show up for all sorts of properties (e.g. moral quality of decisions), not just chess skill.
My concern is, interpretability may be dangerous, or lead to a higher P(doom), in a different way.
The problem is, if we have a better way of steering LLMs towards a certain set of value systems, how can we guarantee that the “value system” is right? For example, steering LLMs towards a certain value system can be easily abused to massively generate fake news that are more ideologically consistent and aligned. Steering can make LLMs omit information that offers a neutral point of view. This seems to be a different form of “doom” comparing with AI taking full control.
Kinda, my current mainline-doom-case is “some AI gets controlled --> powerful people use it to prop themselves up --> world gets worse until AI gets uncontrollably bad --> doom”. I would call it a different yet also-important doom case of “perpetual low-grade-AI dictatorship where the AI is controlled by humans in a surveillance state”.
Another way interpretability work can be harmful: some means by which advanced AIs could do harm require them to be credible. For example, in unboxing scenarios where a human has something an AI wants (like access to the internet), the AI might be much more persuasive if the gatekeeper can verify the AI’s statements using interpretability tools. Otherwise, the gatekeeper might be inclined to dismiss anything the AI says as plausibly fabricated. (And interpretability tools provided by the AI might be more suspect than those developed beforehand.)
It’s unclear to me whether interpretability tools have much of a chance of becoming good enough to detect deception in highly capable AIs. And there are promising uses of low-capability-only interpretability—like detecting early gradient hacking attempts, or designing an aligned low-capability AI that we are confident will scale well. But to the extent that detecting deception in advanced AIs is one of the main upsides of interpretability work people have in mind (or if people do think that interpretability tools are likely to scale to highly capable agents by default), the downsides of those systems being credible will be important to consider as well.
I am a bit confused by your operationalization of “Dangerous”. On one hand
I posit that interpretability work is “dangerous” when it enhances the overall capabilities of an AI system, without making that system more aligned with human goals
is a definition I broadly agree with, especially since you want it to track the alignment-capabilities trade-off (see also this post). However, your examples suggest a more deontological approach:
This suggests a few concrete rules-of-thumb, which a researcher can apply to their interpretability project P: …
If P makes it easier/more efficient to train powerful AI models, thenP is dangerous.
Do you buy the alignment-capabilities trade-off model, or are you trying to establish principles for interpretability research? (or if both, please clarify what definition we’re using here)
Good point. My basic idea is something like “most interp work makes it more efficient to train/use increasingly-powerful/dangerous models”. So I think the two uses of “dangerous” you quote here, both fit with this idea.
“Can we control the thought and behavior patterns of a powerful mind at all?”-I do not see why this would not be the case. For example, in a neural network, if we are able to find a cluster of problematic neurons, then we will be able to remove those neurons. With that being said, I do not know how well this works in practice. After removing the neurons (and normalizing so that the remaining neurons are given higher weights), if we do not retrain the neural network, then it could exhibit more unexpected or poor behavior. If we do retrain the network, then the network could regrow the problematic neurons. Furthermore, if we continually remove problematic neuron clusters, then the neural network could become less interpretable. The process of detecting and removing problematic neuron clusters will be a selective pressure that will cause the neurons to either behave well or behave poorly but evade detection. One solution to this problem may be to employ several different techniques for detecting problematic neuron clusters so that it is harder for these problematic clusters to evade detection. Of course, there may still be problematic neuron clusters that evade detection. But these problematic neuron clusters will probably be much less effective at behaving problematically since these problematic neuron clusters would need to trade performance for the ability to evade detection. For example, the process of detecting problematic neuron clusters could detect large problematic neuron clusters, but small problematic neuron clusters could avoid detection. In this case, the small problematic neuron clusters would be less effective and less worrisome simply because smaller neuron clusters would have a more difficult time causing problems.
There is a causal relationship between time on LW and frequency of paragraph breaks :P
Anyhow, I broadly agree with this comment, but I’d say it’s also an illustration of why interpretability has diminishing returns and we really need to also be doing “positive alignment.” If you just define some bad behaviors and ablate neurons associated with those bad behaviors (or do other things like filter the AI’s output), this can make your AI safer but with ~exponentially diminishing returns on the selection pressure you apply.
What we’d also like to be doing is defining good behaviors and helping the AI develop novel capabilities to pursue those good behaviors. This is trickier because maybe you can’t just jam the internet at self-supervised learning to do it, so it has more bits that look like the “classic” alignment problem.
I agree that black box alignment research (where we do not look at what the hidden layers are doing) is crucial for AI and AGI safety.
I just personally am more interested in interpretability than direct alignment because I think I am currently better at making interpretable machine learning models and interpretability tools and because I can make my observations rigorous enough for anyone who is willing to copy my experiments or read my proofs to be convinced. This just may be more to do with my area of expertise than any objective value in the importance of interpretability vs black box alignment.
Can you elaborate on what you mean by ‘exponentially diminishing returns’? I don’t think I fully get that or why that may be the case.
If you start with an AI that makes decisions of middling quality, how well can you get it to make high-quality decisions by ablating neurons associated with bad decisions? This is the centeal thing I expect to have diminishing returns (though it’s related to some other uses of unterpretability that might also have diminishing returns).
If you take a predictive model of chess games trained on human play, it’s probably not too hard to get it to play near the 90th percentile of the dataset. But it’s not going to play as well as stockfish almost no matter what you do. The AI is a bit flexible, especially in ways the training data has prediction-relevant variation, but it’s not arbitrarily flexible, and once you’ve changed the few most important neurons the other neurons will be progressively less important. I expect this to show up for all sorts of properties (e.g. moral quality of decisions), not just chess skill.
My concern is, interpretability may be dangerous, or lead to a higher P(doom), in a different way.
The problem is, if we have a better way of steering LLMs towards a certain set of value systems, how can we guarantee that the “value system” is right? For example, steering LLMs towards a certain value system can be easily abused to massively generate fake news that are more ideologically consistent and aligned. Steering can make LLMs omit information that offers a neutral point of view. This seems to be a different form of “doom” comparing with AI taking full control.
Kinda, my current mainline-doom-case is “some AI gets controlled --> powerful people use it to prop themselves up --> world gets worse until AI gets uncontrollably bad --> doom”. I would call it a different yet also-important doom case of “perpetual low-grade-AI dictatorship where the AI is controlled by humans in a surveillance state”.
Another way interpretability work can be harmful: some means by which advanced AIs could do harm require them to be credible. For example, in unboxing scenarios where a human has something an AI wants (like access to the internet), the AI might be much more persuasive if the gatekeeper can verify the AI’s statements using interpretability tools. Otherwise, the gatekeeper might be inclined to dismiss anything the AI says as plausibly fabricated. (And interpretability tools provided by the AI might be more suspect than those developed beforehand.)
It’s unclear to me whether interpretability tools have much of a chance of becoming good enough to detect deception in highly capable AIs. And there are promising uses of low-capability-only interpretability—like detecting early gradient hacking attempts, or designing an aligned low-capability AI that we are confident will scale well. But to the extent that detecting deception in advanced AIs is one of the main upsides of interpretability work people have in mind (or if people do think that interpretability tools are likely to scale to highly capable agents by default), the downsides of those systems being credible will be important to consider as well.
I am a bit confused by your operationalization of “Dangerous”. On one hand
is a definition I broadly agree with, especially since you want it to track the alignment-capabilities trade-off (see also this post). However, your examples suggest a more deontological approach:
Do you buy the alignment-capabilities trade-off model, or are you trying to establish principles for interpretability research? (or if both, please clarify what definition we’re using here)
Good point. My basic idea is something like “most interp work makes it more efficient to train/use increasingly-powerful/dangerous models”. So I think the two uses of “dangerous” you quote here, both fit with this idea.