Mechanistic Interpretability exacerbates AI capabilities development — possibly significantly.
My uninformed guess is that a lot of the Mechanistic Interpretability community (Olah, Nanda, Anthropic, etc) should (on the margin) be less public about their work. But this isn’t something I’m certain about or have put much thought into.
Does anyone know what their disclosure policy looks like?
Mechanistic Interpretability exacerbates AI capabilities development — possibly significantly.
My uninformed guess is that a lot of the Mechanistic Interpretability community (Olah, Nanda, Anthropic, etc) should (on the margin) be less public about their work. But this isn’t something I’m certain about or have put much thought into.
Does anyone know what their disclosure policy looks like?