Arthur Conmy comments on Why I’m Working On Model Agnostic Interpretability

Arthur Conmy 12 Nov 2022 18:59 UTC
6 points
0
I think both of these questions are too general to have useful debate on. 2) is essentially a forecasting question, and 1) also relies on forecasting whether future AI systems will be similar in kind. It’s unclear whether current mechanistic interpretability efforts will scale to future systems. Even if they will not scale, it’s unclear whether the best research direction now is general research, rather than fast-feedback-loop work on specific systems.
It’s worth noting that academia and the alignment community are generally unexcited about naive applications of saliency maps; see the video, and https://arxiv.org/abs/1810.03292