Corrigibility are not Composable

lberglund 8 Aug 2022 23:28 UTC
9 points
3
The way I see it having a lower level understanding of things allows you to create abstractions about their behavior that you can use to understand them on a higher level. For example, if you understand how transistors work on a lower level you can abstract away their behavior and more efficiently examine how they wire together to create memory and processor. This is why I believe that a circuits-style approach is the most promising one we have for interpretability.
Do you agree that a lower level understanding of things is often the best way to achieve a higher level understanding, in particular regarding neural network interpretability, or would you advocate for a different approach?

lberglund comments on Interpretability/​Tool-ness/​Alignment/​Corrigibility are not Composable