If we want to reduce near and long term risks from AI, we should care a lot about interpretability tools. This is a very uncontroversial claim to make inside the AI safety community. Almost every agenda for safe advanced AI incorporates interpretability in some way. The key value of interpretability tools is that they aid in human oversight by enabling open-ended evaluation.
Hmm, I actually don’t think this is uncontroversial if by ‘interpretability’ you mean mechanistic interpretability.
I think there’s a pretty plausible argument that doing anything other than running your AI (and training it) will end up being irrelevant. And this argument could extend to thinking that the expected value of working on (mechanistic) interpretability is considerably lower than other domains.
If by interpretability, you mean ‘understand what the AI is doing via any means’, then it seem very likely to be useful and widely used (see here for instance, but the idea of trying to understand what the model is doing by interacting with it is very basic). I’m not currently sure what research should be done in this domain, but there are evals projects iterating on this sort of work.
Hmm, I actually don’t think this is uncontroversial if by ‘interpretability’ you mean mechanistic interpretability. I think there’s a pretty plausible argument that doing anything other than running your AI (and training it) will end up being irrelevant. And this argument could extend to thinking that the expected value of working on (mechanistic) interpretability is considerably lower than other domains.
Conditional on this argument being correct, my response is that I would start advocating for simply slowing down or even stopping AI, because this is a world where we have to get very lucky to succeed at aligning AI.
I guess I’m considerably more optimistic on avoiding AI takeover without humans understanding what the models are thinking. (Or possibly you’re more optimistic about slowing down AI)
FWIW, white box alignment doesn’t imply humans understand what the models are thinking. There are other ways to leverage the fact that we have access to the internals.
Hmm, I actually don’t think this is uncontroversial if by ‘interpretability’ you mean mechanistic interpretability. I think there’s a pretty plausible argument that doing anything other than running your AI (and training it) will end up being irrelevant. And this argument could extend to thinking that the expected value of working on (mechanistic) interpretability is considerably lower than other domains.
If by interpretability, you mean ‘understand what the AI is doing via any means’, then it seem very likely to be useful and widely used (see here for instance, but the idea of trying to understand what the model is doing by interacting with it is very basic). I’m not currently sure what research should be done in this domain, but there are evals projects iterating on this sort of work.
Oh, it seems like you’re reluctant to define interpretability, but if anything into using a very broad definition. Fair enough, I certainly agree that “methods by which something novel about a system can be better predicted or described” are important.
Conditional on this argument being correct, my response is that I would start advocating for simply slowing down or even stopping AI, because this is a world where we have to get very lucky to succeed at aligning AI.
I guess I’m considerably more optimistic on avoiding AI takeover without humans understanding what the models are thinking. (Or possibly you’re more optimistic about slowing down AI)
Basically this. I am a lot more pessimistic around black box alignment than I am around white box alignment.
FWIW, white box alignment doesn’t imply humans understand what the models are thinking. There are other ways to leverage the fact that we have access to the internals.
I was using it as a synonym for alignment with interpretability compared to without interpretability.