mishka comments on If interpretability research goes well, it may get dangerous

mishka 5 Apr 2023 15:14 UTC
4 points
3
I hope people will ponder this.

Ideally, one wants “negative alignment tax”, so that aligned systems progress faster than the unaligned ones.

And if alignment work does lead to capability boost, one might get exactly that. But then suddenly people pursuing such work might find themselves actually grappling with all the responsibilities of being a capabilities leader. If they are focused on alignment, this presumably reduces the overall risks, but I don’t think we’ll ever end up being in a situation of zero risk.

I think we need to start talking about this, both in terms of policies of sharing/not sharing information and in terms of how we expect an alignment-focus organization to handle the risks, if it finds itself in a position when it might be ready to actually create a truly powerful AI way above state-of-the-art.