Charlie Steiner comments on Why and When Interpretability Work is Dangerous

Charlie Steiner 30 May 2023 21:02 UTC
5 points
0
There is a causal relationship between time on LW and frequency of paragraph breaks :P
Anyhow, I broadly agree with this comment, but I’d say it’s also an illustration of why interpretability has diminishing returns and we really need to also be doing “positive alignment.” If you just define some bad behaviors and ablate neurons associated with those bad behaviors (or do other things like filter the AI’s output), this can make your AI safer but with ~exponentially diminishing returns on the selection pressure you apply.
What we’d also like to be doing is defining good behaviors and helping the AI develop novel capabilities to pursue those good behaviors. This is trickier because maybe you can’t just jam the internet at self-supervised learning to do it, so it has more bits that look like the “classic” alignment problem.
- Joseph Van Name 3 Jun 2023 19:04 UTC
  1 point
  0
  Parent
  I agree that black box alignment research (where we do not look at what the hidden layers are doing) is crucial for AI and AGI safety.
  I just personally am more interested in interpretability than direct alignment because I think I am currently better at making interpretable machine learning models and interpretability tools and because I can make my observations rigorous enough for anyone who is willing to copy my experiments or read my proofs to be convinced. This just may be more to do with my area of expertise than any objective value in the importance of interpretability vs black box alignment.
  Can you elaborate on what you mean by ‘exponentially diminishing returns’? I don’t think I fully get that or why that may be the case.
  - Charlie Steiner 3 Jun 2023 21:09 UTC
    4 points
    0
    Parent
    If you start with an AI that makes decisions of middling quality, how well can you get it to make high-quality decisions by ablating neurons associated with bad decisions? This is the centeal thing I expect to have diminishing returns (though it’s related to some other uses of unterpretability that might also have diminishing returns).
    
    If you take a predictive model of chess games trained on human play, it’s probably not too hard to get it to play near the 90th percentile of the dataset. But it’s not going to play as well as stockfish almost no matter what you do. The AI is a bit flexible, especially in ways the training data has prediction-relevant variation, but it’s not arbitrarily flexible, and once you’ve changed the few most important neurons the other neurons will be progressively less important. I expect this to show up for all sorts of properties (e.g. moral quality of decisions), not just chess skill.