I think the main reasons to work on mechanistic interp do not look like “we can literally understand all the cognition behind a powerful AI”, but instead “we can bound the behavior of the AI”
I assume “bound the behavior” means provide a worst-case guarantee. But if we don’t understand all the cognition, how can we provide such a guarantee? How do we know that the part of the AI we don’t understand wouldn’t ruin our guarantee?
we can help other, weaker AIs understand the powerful AI
My understanding of interpretability is that humans understand what the AI is doing. Weaker AIs understanding the powerful AI doesn’t feel like a solution to interpretability. Instead it feels like a solution to amplification that’s still uninterpretable by humans.
My understanding of interpretability is that humans understand what the AI is doing. Weaker AIs understanding the powerful AI doesn’t feel like a solution to interpretability. Instead it feels like a solution to amplification that’s ultimately still uninterpretable by humans.
This somewhat feels like semantics to me—this still feels like a win condition! I don’t personally care about whether interpretability helps via humans directly understanding the systems themselves, vs us somewhat understanding it ourselves and being able to use weaker AI systems to fully understand it, so long as it’s good enough to make aligned systems.
I also think that interpretability lies on a spectrum rather than being a binary.
I assume “bound the behavior” means provide a worst-case guarantee. But if we don’t understand all the cognition, how can we provide such a guarantee? How do we know that the part of the AI we don’t understand wouldn’t ruin our guarantee?
My understanding of interpretability is that humans understand what the AI is doing. Weaker AIs understanding the powerful AI doesn’t feel like a solution to interpretability. Instead it feels like a solution to amplification that’s still uninterpretable by humans.
This somewhat feels like semantics to me—this still feels like a win condition! I don’t personally care about whether interpretability helps via humans directly understanding the systems themselves, vs us somewhat understanding it ourselves and being able to use weaker AI systems to fully understand it, so long as it’s good enough to make aligned systems.
I also think that interpretability lies on a spectrum rather than being a binary.