Neel Nanda comments on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

Neel Nanda 31 Oct 2022 22:16 UTC
LW: 2 AF: 1
0
AF

My understanding of interpretability is that humans understand what the AI is doing. Weaker AIs understanding the powerful AI doesn’t feel like a solution to interpretability. Instead it feels like a solution to amplification that’s ultimately still uninterpretable by humans.

This somewhat feels like semantics to me—this still feels like a win condition! I don’t personally care about whether interpretability helps via humans directly understanding the systems themselves, vs us somewhat understanding it ourselves and being able to use weaker AI systems to fully understand it, so long as it’s good enough to make aligned systems.

I also think that interpretability lies on a spectrum rather than being a binary.