I think that the general form of the problem is context-dependent, as you describe. Useful explanations do seem to depend on the model, task, and risks involved.
However, from an AI safety perspective, we’re probably only considering a restricted set of interpretability approaches, which might make it easier. In the safety context, we can probably less concerned with interpretability that is useful for laypeople, and focus on interpretability that is useful for the people doing the technical work.
To that end, I think that “just” being careful about what the interpretability analysis means can help, like how good statisticians can avoid misuse of statistical testing, even though many practitioners get it wrong.
I think it’s still an open question, though, what even this sort of “only useful for people who know what they’re doing” interpretability analysis would be. Existing approaches still have many issues.
While I agree with you that setting the context as Safety narrows down the requirements space for interpretability, I think there could be more than just excluding the interpretability of non-technical users from the picture. The inspections that technicians would want to be reassured about the model safety are probably around its motivations (e.g is the system goal directed, is the system producing mesa-optimisers). However, it is still unclear to me how this relate with other interpretability desiderata you present in the post.
Plus, one could also immagine safety-relevant scenarios were it may be necessary or useful for non-technical users to be able to interpret the model. For instance, if the model has been deployed and is adaptive and we somehow cannot do this automatically, we would probably want users to be able to inspect if the system is somehow made a decision for the wrong reason.
Regarding how interpretability can help with addressing motivation issues, I think Chris Olah’s views present situations where interpretability can potentially sidestep some of those issues. One such example is that if we use interpretability to aid in model design, we might have confidence that our system isn’t a mesa-optimizer, and we’ve done this without explicitly asking questions about “what our model desires”.
I agree that this is far from the whole picture. The scenario you describe is an example where we’d want to make interpretability more accessible to more end-users. There is definitely more work to be done to bridge “normal” human explanations with what we can get from our analysis.
I’ve spent more of my time thinking about the technical sub-areas, so I’m focused on situations where innovations there can be useful. I don’t mean to say that this is the only place where I think progress is useful.
I’ve spent more of my time thinking about the technical sub-areas, so I’m focused on situations where innovations there can be useful. I don’t mean to say that this is the only place where I think progress is useful.
That seems more than reasonable to me, given the current state of AI development.
Thanks for sharing your reflections on my comment.
I think that the general form of the problem is context-dependent, as you describe. Useful explanations do seem to depend on the model, task, and risks involved.
However, from an AI safety perspective, we’re probably only considering a restricted set of interpretability approaches, which might make it easier. In the safety context, we can probably less concerned with interpretability that is useful for laypeople, and focus on interpretability that is useful for the people doing the technical work.
To that end, I think that “just” being careful about what the interpretability analysis means can help, like how good statisticians can avoid misuse of statistical testing, even though many practitioners get it wrong.
I think it’s still an open question, though, what even this sort of “only useful for people who know what they’re doing” interpretability analysis would be. Existing approaches still have many issues.
While I agree with you that setting the context as Safety narrows down the requirements space for interpretability, I think there could be more than just excluding the interpretability of non-technical users from the picture. The inspections that technicians would want to be reassured about the model safety are probably around its motivations (e.g is the system goal directed, is the system producing mesa-optimisers). However, it is still unclear to me how this relate with other interpretability desiderata you present in the post.
Plus, one could also immagine safety-relevant scenarios were it may be necessary or useful for non-technical users to be able to interpret the model. For instance, if the model has been deployed and is adaptive and we somehow cannot do this automatically, we would probably want users to be able to inspect if the system is somehow made a decision for the wrong reason.
Regarding how interpretability can help with addressing motivation issues, I think Chris Olah’s views present situations where interpretability can potentially sidestep some of those issues. One such example is that if we use interpretability to aid in model design, we might have confidence that our system isn’t a mesa-optimizer, and we’ve done this without explicitly asking questions about “what our model desires”.
I agree that this is far from the whole picture. The scenario you describe is an example where we’d want to make interpretability more accessible to more end-users. There is definitely more work to be done to bridge “normal” human explanations with what we can get from our analysis.
I’ve spent more of my time thinking about the technical sub-areas, so I’m focused on situations where innovations there can be useful. I don’t mean to say that this is the only place where I think progress is useful.
That seems more than reasonable to me, given the current state of AI development.
Thanks for sharing your reflections on my comment.