I agree that black box alignment research (where we do not look at what the hidden layers are doing) is crucial for AI and AGI safety.
I just personally am more interested in interpretability than direct alignment because I think I am currently better at making interpretable machine learning models and interpretability tools and because I can make my observations rigorous enough for anyone who is willing to copy my experiments or read my proofs to be convinced. This just may be more to do with my area of expertise than any objective value in the importance of interpretability vs black box alignment.
Can you elaborate on what you mean by ‘exponentially diminishing returns’? I don’t think I fully get that or why that may be the case.
If you start with an AI that makes decisions of middling quality, how well can you get it to make high-quality decisions by ablating neurons associated with bad decisions? This is the centeal thing I expect to have diminishing returns (though it’s related to some other uses of unterpretability that might also have diminishing returns).
If you take a predictive model of chess games trained on human play, it’s probably not too hard to get it to play near the 90th percentile of the dataset. But it’s not going to play as well as stockfish almost no matter what you do. The AI is a bit flexible, especially in ways the training data has prediction-relevant variation, but it’s not arbitrarily flexible, and once you’ve changed the few most important neurons the other neurons will be progressively less important. I expect this to show up for all sorts of properties (e.g. moral quality of decisions), not just chess skill.
I agree that black box alignment research (where we do not look at what the hidden layers are doing) is crucial for AI and AGI safety.
I just personally am more interested in interpretability than direct alignment because I think I am currently better at making interpretable machine learning models and interpretability tools and because I can make my observations rigorous enough for anyone who is willing to copy my experiments or read my proofs to be convinced. This just may be more to do with my area of expertise than any objective value in the importance of interpretability vs black box alignment.
Can you elaborate on what you mean by ‘exponentially diminishing returns’? I don’t think I fully get that or why that may be the case.
If you start with an AI that makes decisions of middling quality, how well can you get it to make high-quality decisions by ablating neurons associated with bad decisions? This is the centeal thing I expect to have diminishing returns (though it’s related to some other uses of unterpretability that might also have diminishing returns).
If you take a predictive model of chess games trained on human play, it’s probably not too hard to get it to play near the 90th percentile of the dataset. But it’s not going to play as well as stockfish almost no matter what you do. The AI is a bit flexible, especially in ways the training data has prediction-relevant variation, but it’s not arbitrarily flexible, and once you’ve changed the few most important neurons the other neurons will be progressively less important. I expect this to show up for all sorts of properties (e.g. moral quality of decisions), not just chess skill.