mishka comments on If interpretability research goes well, it may get dangerous

mishka 4 Apr 2023 2:15 UTC
37 points
18
The problem here is that any effective alignment research is very powerful capability research, almost by definition. If one can actually steer or constrain a powerful AI system, this is a very powerful capability by itself and would enable all kinds of capability boosts.

And imagine one wants to study the core problem: how to preserve values and goals through recursive self-improvement and “sharp left turns”. And imagine that one would like to actually study this problem experimentally, and not just theoretically. Well, one can probably create a strictly bounded environment for “mini-foom” experiments (drastic changes in a really small, closed world). But all fruitful techniques for recursive self-improvement learned during such experiments would be immediately applicable for reckless recursive self-improvement in the wild.

How should we start addressing this?
- Lucius Bushnaq 5 Apr 2023 9:19 UTC
  6 points
  3
  Parent
  Best I’ve got is to go dark once it feels like you’re really getting somewhere, and only work with people under NDAs (honour based or actually legally binding) from there on out. At least a facsimile of proper security, central white lists of orgs and people considered trustworthy, central standard communication protocols with security levels set up to facilitate communication between alignment researchers. Maybe a forum system that isn’t on the public net. Live with the decrease in research efficiency this brings, and try to make it to the finish line in time anyway.
  If some org or people would make it their job to start developing and trial running these measures right now, I think that’d be great. I think even today, some researchers might be enabled to collaborate more by this.
  Very open to alternate solutions that don’t cost so much efficiency if anyone can think of any, but I’ve got squat.
  - mishka 5 Apr 2023 15:14 UTC
    4 points
    3
    Parent
    I hope people will ponder this.
    
    Ideally, one wants “negative alignment tax”, so that aligned systems progress faster than the unaligned ones.
    
    And if alignment work does lead to capability boost, one might get exactly that. But then suddenly people pursuing such work might find themselves actually grappling with all the responsibilities of being a capabilities leader. If they are focused on alignment, this presumably reduces the overall risks, but I don’t think we’ll ever end up being in a situation of zero risk.
    
    I think we need to start talking about this, both in terms of policies of sharing/not sharing information and in terms of how we expect an alignment-focus organization to handle the risks, if it finds itself in a position when it might be ready to actually create a truly powerful AI way above state-of-the-art.
- Thoth Hermes 4 Apr 2023 13:00 UTC
  2 points
  1
  Parent
  One major capabilities hurdle that’s related to interpretability: The difference between manually “opening up” the model to analyze its weights, etc., and being able to literally ask the model questions about why it did certain things.
  - Matt Goldenberg 4 Apr 2023 15:08 UTC
    4 points
    3
    Parent
    And it seems like a path to solving that is to have the AI be able to analyze its own workinga, which seems like a potential path to recursive self improvement as well