I think both of these questions are too general to have useful debate on. 2) is essentially a forecasting question, and 1) also relies on forecasting whether future AI systems will be similar in kind. It’s unclear whether current mechanistic interpretability efforts will scale to future systems. Even if they will not scale, it’s unclear whether the best research direction now is general research, rather than fast-feedback-loop work on specific systems.
It’s worth noting that academia and the alignment community are generally unexcited about naive applications of saliency maps; see the video, and https://arxiv.org/abs/1810.03292
I think both of these questions are too general to have useful debate on. 2) is essentially a forecasting question, and 1) also relies on forecasting whether future AI systems will be similar in kind. It’s unclear whether current mechanistic interpretability efforts will scale to future systems. Even if they will not scale, it’s unclear whether the best research direction now is general research, rather than fast-feedback-loop work on specific systems.
It’s worth noting that academia and the alignment community are generally unexcited about naive applications of saliency maps; see the video, and https://arxiv.org/abs/1810.03292