Scott Emmons comments on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

Scott Emmons 31 Oct 2022 22:12 UTC
LW: 1 AF: 1
2
AF
I think the main reasons to work on mechanistic interp do not look like “we can literally understand all the cognition behind a powerful AI”, but instead “we can bound the behavior of the AI”
I assume “bound the behavior” means provide a worst-case guarantee. But if we don’t understand all the cognition, how can we provide such a guarantee? How do we know that the part of the AI we don’t understand wouldn’t ruin our guarantee?

we can help other, weaker AIs understand the powerful AI
My understanding of interpretability is that humans understand what the AI is doing. Weaker AIs understanding the powerful AI doesn’t feel like a solution to interpretability. Instead it feels like a solution to amplification that’s still uninterpretable by humans.
- Neel Nanda 31 Oct 2022 22:16 UTC
  LW: 2 AF: 1
  0
  AF Parent
  
  My understanding of interpretability is that humans understand what the AI is doing. Weaker AIs understanding the powerful AI doesn’t feel like a solution to interpretability. Instead it feels like a solution to amplification that’s ultimately still uninterpretable by humans.
  
  This somewhat feels like semantics to me—this still feels like a win condition! I don’t personally care about whether interpretability helps via humans directly understanding the systems themselves, vs us somewhat understanding it ourselves and being able to use weaker AI systems to fully understand it, so long as it’s good enough to make aligned systems.
  
  I also think that interpretability lies on a spectrum rather than being a binary.

Scott Emmons comments on “Cars and Elephants”: a handwavy argument/​analogy against mechanistic interpretability

Scott Emmons comments on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability