All the critiques focus on MI not being effective enough at its ultimate purpose—namely, interpretability, and secondarily, finding adversaries (I guess), and maybe something else?
Did you seriously think through whether interpretability, and/or finding adversaries, or some specific aspects or kinds of either interoperability or finding adversaries could be net negative for safety overall? Such as what was contemplated in “AGI-Automated Interpretability is Suicide”, “AI interpretability could be harmful?”, and “Why and When Interpretability Work is Dangerous”. However, I think that none of the authors of these three posts is an expert in interpretability or adversaries, so it would be really interesting to see your thinking on this topic.
All the critiques focus on MI not being effective enough at its ultimate purpose—namely, interpretability, and secondarily, finding adversaries (I guess), and maybe something else?
Did you seriously think through whether interpretability, and/or finding adversaries, or some specific aspects or kinds of either interoperability or finding adversaries could be net negative for safety overall? Such as what was contemplated in “AGI-Automated Interpretability is Suicide”, “AI interpretability could be harmful?”, and “Why and When Interpretability Work is Dangerous”. However, I think that none of the authors of these three posts is an expert in interpretability or adversaries, so it would be really interesting to see your thinking on this topic.