Lucius Bushnaq comments on If interpretability research goes well, it may get dangerous

Lucius Bushnaq 5 Apr 2023 9:19 UTC
6 points
3
Best I’ve got is to go dark once it feels like you’re really getting somewhere, and only work with people under NDAs (honour based or actually legally binding) from there on out. At least a facsimile of proper security, central white lists of orgs and people considered trustworthy, central standard communication protocols with security levels set up to facilitate communication between alignment researchers. Maybe a forum system that isn’t on the public net. Live with the decrease in research efficiency this brings, and try to make it to the finish line in time anyway.
If some org or people would make it their job to start developing and trial running these measures right now, I think that’d be great. I think even today, some researchers might be enabled to collaborate more by this.
Very open to alternate solutions that don’t cost so much efficiency if anyone can think of any, but I’ve got squat.
- mishka 5 Apr 2023 15:14 UTC
  4 points
  3
  Parent
  I hope people will ponder this.
  
  Ideally, one wants “negative alignment tax”, so that aligned systems progress faster than the unaligned ones.
  
  And if alignment work does lead to capability boost, one might get exactly that. But then suddenly people pursuing such work might find themselves actually grappling with all the responsibilities of being a capabilities leader. If they are focused on alignment, this presumably reduces the overall risks, but I don’t think we’ll ever end up being in a situation of zero risk.
  
  I think we need to start talking about this, both in terms of policies of sharing/not sharing information and in terms of how we expect an alignment-focus organization to handle the risks, if it finds itself in a position when it might be ready to actually create a truly powerful AI way above state-of-the-art.