While I agree with you that setting the context as Safety narrows down the requirements space for interpretability, I think there could be more than just excluding the interpretability of non-technical users from the picture. The inspections that technicians would want to be reassured about the model safety are probably around its motivations (e.g is the system goal directed, is the system producing mesa-optimisers). However, it is still unclear to me how this relate with other interpretability desiderata you present in the post.
Plus, one could also immagine safety-relevant scenarios were it may be necessary or useful for non-technical users to be able to interpret the model. For instance, if the model has been deployed and is adaptive and we somehow cannot do this automatically, we would probably want users to be able to inspect if the system is somehow made a decision for the wrong reason.
Regarding how interpretability can help with addressing motivation issues, I think Chris Olah’s views present situations where interpretability can potentially sidestep some of those issues. One such example is that if we use interpretability to aid in model design, we might have confidence that our system isn’t a mesa-optimizer, and we’ve done this without explicitly asking questions about “what our model desires”.
I agree that this is far from the whole picture. The scenario you describe is an example where we’d want to make interpretability more accessible to more end-users. There is definitely more work to be done to bridge “normal” human explanations with what we can get from our analysis.
I’ve spent more of my time thinking about the technical sub-areas, so I’m focused on situations where innovations there can be useful. I don’t mean to say that this is the only place where I think progress is useful.
I’ve spent more of my time thinking about the technical sub-areas, so I’m focused on situations where innovations there can be useful. I don’t mean to say that this is the only place where I think progress is useful.
That seems more than reasonable to me, given the current state of AI development.
Thanks for sharing your reflections on my comment.
While I agree with you that setting the context as Safety narrows down the requirements space for interpretability, I think there could be more than just excluding the interpretability of non-technical users from the picture. The inspections that technicians would want to be reassured about the model safety are probably around its motivations (e.g is the system goal directed, is the system producing mesa-optimisers). However, it is still unclear to me how this relate with other interpretability desiderata you present in the post.
Plus, one could also immagine safety-relevant scenarios were it may be necessary or useful for non-technical users to be able to interpret the model. For instance, if the model has been deployed and is adaptive and we somehow cannot do this automatically, we would probably want users to be able to inspect if the system is somehow made a decision for the wrong reason.
Regarding how interpretability can help with addressing motivation issues, I think Chris Olah’s views present situations where interpretability can potentially sidestep some of those issues. One such example is that if we use interpretability to aid in model design, we might have confidence that our system isn’t a mesa-optimizer, and we’ve done this without explicitly asking questions about “what our model desires”.
I agree that this is far from the whole picture. The scenario you describe is an example where we’d want to make interpretability more accessible to more end-users. There is definitely more work to be done to bridge “normal” human explanations with what we can get from our analysis.
I’ve spent more of my time thinking about the technical sub-areas, so I’m focused on situations where innovations there can be useful. I don’t mean to say that this is the only place where I think progress is useful.
That seems more than reasonable to me, given the current state of AI development.
Thanks for sharing your reflections on my comment.