Thanks for the in-depth post on topic. Your last paragraph on Utility is thought-provoking to say the least. I have seen a lot of work claiming to make models interpretable—and factually doing so as well—about which I felt an itch I could not fully verbalise. I think your point on Utility puts the finger on it: most of these works were technically interpreting the model but not actually useful to the user.
From this, we can also partially explain the current difficulties around “find better ways to formalize what we mean by interpretability”. If acceptable interpretability depend on its usefulness, then it becomes context-dependent, which blows up the complexity of the attempt.
Hence Interpretability seems to be an interdisciplinary problem. One that requires having the rigth culture and principles to adopt the right interpretability tools for the right problem. This seems to be confirmed by the many contributions you shared from the social sciences on the topic.
I think that the general form of the problem is context-dependent, as you describe. Useful explanations do seem to depend on the model, task, and risks involved.
However, from an AI safety perspective, we’re probably only considering a restricted set of interpretability approaches, which might make it easier. In the safety context, we can probably less concerned with interpretability that is useful for laypeople, and focus on interpretability that is useful for the people doing the technical work.
To that end, I think that “just” being careful about what the interpretability analysis means can help, like how good statisticians can avoid misuse of statistical testing, even though many practitioners get it wrong.
I think it’s still an open question, though, what even this sort of “only useful for people who know what they’re doing” interpretability analysis would be. Existing approaches still have many issues.
While I agree with you that setting the context as Safety narrows down the requirements space for interpretability, I think there could be more than just excluding the interpretability of non-technical users from the picture. The inspections that technicians would want to be reassured about the model safety are probably around its motivations (e.g is the system goal directed, is the system producing mesa-optimisers). However, it is still unclear to me how this relate with other interpretability desiderata you present in the post.
Plus, one could also immagine safety-relevant scenarios were it may be necessary or useful for non-technical users to be able to interpret the model. For instance, if the model has been deployed and is adaptive and we somehow cannot do this automatically, we would probably want users to be able to inspect if the system is somehow made a decision for the wrong reason.
Regarding how interpretability can help with addressing motivation issues, I think Chris Olah’s views present situations where interpretability can potentially sidestep some of those issues. One such example is that if we use interpretability to aid in model design, we might have confidence that our system isn’t a mesa-optimizer, and we’ve done this without explicitly asking questions about “what our model desires”.
I agree that this is far from the whole picture. The scenario you describe is an example where we’d want to make interpretability more accessible to more end-users. There is definitely more work to be done to bridge “normal” human explanations with what we can get from our analysis.
I’ve spent more of my time thinking about the technical sub-areas, so I’m focused on situations where innovations there can be useful. I don’t mean to say that this is the only place where I think progress is useful.
I’ve spent more of my time thinking about the technical sub-areas, so I’m focused on situations where innovations there can be useful. I don’t mean to say that this is the only place where I think progress is useful.
That seems more than reasonable to me, given the current state of AI development.
Thanks for sharing your reflections on my comment.
Thanks for the in-depth post on topic. Your last paragraph on Utility is thought-provoking to say the least. I have seen a lot of work claiming to make models interpretable—and factually doing so as well—about which I felt an itch I could not fully verbalise. I think your point on Utility puts the finger on it: most of these works were technically interpreting the model but not actually useful to the user.
From this, we can also partially explain the current difficulties around “find better ways to formalize what we mean by interpretability”. If acceptable interpretability depend on its usefulness, then it becomes context-dependent, which blows up the complexity of the attempt.
Hence Interpretability seems to be an interdisciplinary problem. One that requires having the rigth culture and principles to adopt the right interpretability tools for the right problem. This seems to be confirmed by the many contributions you shared from the social sciences on the topic.
What do you think of this perspective?
I think that the general form of the problem is context-dependent, as you describe. Useful explanations do seem to depend on the model, task, and risks involved.
However, from an AI safety perspective, we’re probably only considering a restricted set of interpretability approaches, which might make it easier. In the safety context, we can probably less concerned with interpretability that is useful for laypeople, and focus on interpretability that is useful for the people doing the technical work.
To that end, I think that “just” being careful about what the interpretability analysis means can help, like how good statisticians can avoid misuse of statistical testing, even though many practitioners get it wrong.
I think it’s still an open question, though, what even this sort of “only useful for people who know what they’re doing” interpretability analysis would be. Existing approaches still have many issues.
While I agree with you that setting the context as Safety narrows down the requirements space for interpretability, I think there could be more than just excluding the interpretability of non-technical users from the picture. The inspections that technicians would want to be reassured about the model safety are probably around its motivations (e.g is the system goal directed, is the system producing mesa-optimisers). However, it is still unclear to me how this relate with other interpretability desiderata you present in the post.
Plus, one could also immagine safety-relevant scenarios were it may be necessary or useful for non-technical users to be able to interpret the model. For instance, if the model has been deployed and is adaptive and we somehow cannot do this automatically, we would probably want users to be able to inspect if the system is somehow made a decision for the wrong reason.
Regarding how interpretability can help with addressing motivation issues, I think Chris Olah’s views present situations where interpretability can potentially sidestep some of those issues. One such example is that if we use interpretability to aid in model design, we might have confidence that our system isn’t a mesa-optimizer, and we’ve done this without explicitly asking questions about “what our model desires”.
I agree that this is far from the whole picture. The scenario you describe is an example where we’d want to make interpretability more accessible to more end-users. There is definitely more work to be done to bridge “normal” human explanations with what we can get from our analysis.
I’ve spent more of my time thinking about the technical sub-areas, so I’m focused on situations where innovations there can be useful. I don’t mean to say that this is the only place where I think progress is useful.
That seems more than reasonable to me, given the current state of AI development.
Thanks for sharing your reflections on my comment.