Roman Leventov comments on A Longlist of Theories of Impact for Interpretability

Roman Leventov 16 Jun 2023 3:18 UTC
1 point
Epistemic learned helplessness: Idk man, do we even need a theory of impact? In what world is ‘actually understanding how our black box systems work’ not helpful?
The real question is not whether (mechanistic) interpretability is helpful, but whether it could also be “harmful”, i.e., speed up capabilities without delivering commensurate or higher improvements in safety (Quintin Pope also talks about this risk in this comment), or by creating a “foom overhang” as described in “AGI-Automated Interpretability is Suicide”. Good interpretability also creates an infosec/infohazard attack vector, as I described here.
Thus, the “theory of impact” for interoperability should not just list the potential benefits of it, but also explain why these benefits are expected to outweigh potential harms, timeline shortening, and new risks.