Hmm, I actually don’t think this is uncontroversial if by ‘interpretability’ you mean mechanistic interpretability. I think there’s a pretty plausible argument that doing anything other than running your AI (and training it) will end up being irrelevant. And this argument could extend to thinking that the expected value of working on (mechanistic) interpretability is considerably lower than other domains.
Conditional on this argument being correct, my response is that I would start advocating for simply slowing down or even stopping AI, because this is a world where we have to get very lucky to succeed at aligning AI.
I guess I’m considerably more optimistic on avoiding AI takeover without humans understanding what the models are thinking. (Or possibly you’re more optimistic about slowing down AI)
FWIW, white box alignment doesn’t imply humans understand what the models are thinking. There are other ways to leverage the fact that we have access to the internals.
Conditional on this argument being correct, my response is that I would start advocating for simply slowing down or even stopping AI, because this is a world where we have to get very lucky to succeed at aligning AI.
I guess I’m considerably more optimistic on avoiding AI takeover without humans understanding what the models are thinking. (Or possibly you’re more optimistic about slowing down AI)
Basically this. I am a lot more pessimistic around black box alignment than I am around white box alignment.
FWIW, white box alignment doesn’t imply humans understand what the models are thinking. There are other ways to leverage the fact that we have access to the internals.
I was using it as a synonym for alignment with interpretability compared to without interpretability.