I think real and ersatz interpretability represent different points on a spectrum, representing different levels of completeness. Each model has a huge collection of factors that decide its behavior. Better explanations abstract away more of those factors in a way humans can understand and use to accurately predict model behavior. Worse explanations cover fewer factors and are less able to reliably predict model behavior.
I’m relatively optimistic about how far we can get with real interpretability. Much of that comes from thinking that we can get pretty far with approaches we currently consider extreme. E.g., I think we can do something like knowledge distillation from AIs to humans by feeding AI internal activations to human brains through channels with wider bandwidth than visual senses. I.e., either through the peripheral nervous system or (more riskily) directly via brain machine interface. So if you have an unknowable concept in an AI, you can target the knowledge distillation process at the concept and learn appropriate intuitions for representing and integrating the concept directly from the AI’s own representations.
I intend to further explore ideas in this space in a future post. Probably, I’ll title it “The case for optimism about radical interpretability”.
Am I correct in thinking the ‘ersatz’ and ‘real’ interpretability might differ in aspects more than just degree of interpretability—Ersatz is somewhat embedded in explaining the typically case, whereas ‘real interpretability’ gives good reasoning eve in the worst-case. Interpretability might be hard to achieve in worst-case scenarios where some atypical wiring leads to wrong decisions?
Furthermore, I suspect confusing transparency for interpretability. Even if we understand what each-and-every-neuron does (radical transparency), it might not be interpretable if it seems gibberish.
If it seems correct so far, I elaborate on these here.
Newbie question: What’s the difference between transparency and interpretability? Follow-up question: Does everyone agree with that answer or is it not standardized?
They’re likely to be interchangeable, sorry. Here I might’ve misused the words to try tease out the difference that simply understanding how a given model works is not really insightful if the patterns are not understandable.
I think there are these nonsensical-seeming-patterns to humans might be a significant fraction of the learned patterns by deep networks. I was trying to understand the radical optimism, in contrast to my pessimism given this. The crux being since we don’t know what these patterns are and what they represent, even if we figure out what neurons detect them and which tasks they contribute most to, might not be able to do downstream tasks we require transparency for, like diagnose possible issues and provide solutions.
I think you’re pointing to a special case of a more general pattern.
Just like there’s a general factor of “athletic ability” which can be subdivided into many correlated components, I think “interpretability” can be split into many correlated components. Some of those components correspond to greater reliability in the worst case scenarios. Others might correspond to, e.g., predicting which architectural modifications would improve performance on typical input.
Worst case reliability is probably the most important component to interpretability, but I’m not sure it makes sense to call it “real interpretability”. It’s probably better to just call it “worst case” interpretability.
I think real and ersatz interpretability represent different points on a spectrum, representing different levels of completeness. Each model has a huge collection of factors that decide its behavior. Better explanations abstract away more of those factors in a way humans can understand and use to accurately predict model behavior. Worse explanations cover fewer factors and are less able to reliably predict model behavior.
I’m relatively optimistic about how far we can get with real interpretability. Much of that comes from thinking that we can get pretty far with approaches we currently consider extreme. E.g., I think we can do something like knowledge distillation from AIs to humans by feeding AI internal activations to human brains through channels with wider bandwidth than visual senses. I.e., either through the peripheral nervous system or (more riskily) directly via brain machine interface. So if you have an unknowable concept in an AI, you can target the knowledge distillation process at the concept and learn appropriate intuitions for representing and integrating the concept directly from the AI’s own representations.
I intend to further explore ideas in this space in a future post. Probably, I’ll title it “The case for optimism about radical interpretability”.
Am I correct in thinking the ‘ersatz’ and ‘real’ interpretability might differ in aspects more than just degree of interpretability—Ersatz is somewhat embedded in explaining the typically case, whereas ‘real interpretability’ gives good reasoning eve in the worst-case. Interpretability might be hard to achieve in worst-case scenarios where some atypical wiring leads to wrong decisions?
Furthermore, I suspect confusing transparency for interpretability. Even if we understand what each-and-every-neuron does (radical transparency), it might not be interpretable if it seems gibberish.
If it seems correct so far, I elaborate on these here.
Newbie question: What’s the difference between transparency and interpretability? Follow-up question: Does everyone agree with that answer or is it not standardized?
They’re likely to be interchangeable, sorry. Here I might’ve misused the words to try tease out the difference that simply understanding how a given model works is not really insightful if the patterns are not understandable.
I think there are these nonsensical-seeming-patterns to humans might be a significant fraction of the learned patterns by deep networks. I was trying to understand the radical optimism, in contrast to my pessimism given this. The crux being since we don’t know what these patterns are and what they represent, even if we figure out what neurons detect them and which tasks they contribute most to, might not be able to do downstream tasks we require transparency for, like diagnose possible issues and provide solutions.
I think you’re pointing to a special case of a more general pattern.
Just like there’s a general factor of “athletic ability” which can be subdivided into many correlated components, I think “interpretability” can be split into many correlated components. Some of those components correspond to greater reliability in the worst case scenarios. Others might correspond to, e.g., predicting which architectural modifications would improve performance on typical input.
Worst case reliability is probably the most important component to interpretability, but I’m not sure it makes sense to call it “real interpretability”. It’s probably better to just call it “worst case” interpretability.