What I call “ersatz interpretability” is when the human figures out some “clues” about what the trained model is doing, like “right now it’s a safe bet that the AI thinks there’s a curve in the picture” or “right now the AI is probably thinking a thought that it believes is high-reward”.
What I call “real interpretability” is the ultimate goal of really understanding what a trained model is doing, why, and how, from top to bottom.
I’m optimistic that we’ll get more than zero out of ersatz interpretability, even as we scale to superintelligent AGI. As you point out, in model-based RL, just looking at the value function gives some nice information about how the AI is assessing its own prospects. That’s trivial, and useful, I seem to recall that in the movie about AlphaGo, the DeepMind engineers were watching AlphaGo’s value function to see whether it expected to win or lose.
Or if we want a strong but imperfect hint that our fancy AI is thinking about blueberries, we can just train a classifier to look at the whole array of network activations, where the supervisory signal is a simple ConvNet blueberry classifier with the same field-of-view as the fancy AI. It won’t be perfect—maybe the AI is imagining blueberries without looking at them or whatever. But it won’t be totally useless either. In fact I’m an advocate for literally having one such classifier for every word in the dictionary, at minimum. (See here, here (search for the words “more dakka”), and more discussion in a forthcoming post.)
At the same time, I’m somewhat pessimistic about “real interpretability”, or at least I’m concerned that it will hit a wall at some point. There are concepts in Ed Witten’s head when he thinks about string theory that I’m just not going to be able to really understand, beyond “this concept has something vaguely to do with this other concept which has something vaguely to do with axions” or whatever.
I think real and ersatz interpretability represent different points on a spectrum, representing different levels of completeness. Each model has a huge collection of factors that decide its behavior. Better explanations abstract away more of those factors in a way humans can understand and use to accurately predict model behavior. Worse explanations cover fewer factors and are less able to reliably predict model behavior.
I’m relatively optimistic about how far we can get with real interpretability. Much of that comes from thinking that we can get pretty far with approaches we currently consider extreme. E.g., I think we can do something like knowledge distillation from AIs to humans by feeding AI internal activations to human brains through channels with wider bandwidth than visual senses. I.e., either through the peripheral nervous system or (more riskily) directly via brain machine interface. So if you have an unknowable concept in an AI, you can target the knowledge distillation process at the concept and learn appropriate intuitions for representing and integrating the concept directly from the AI’s own representations.
I intend to further explore ideas in this space in a future post. Probably, I’ll title it “The case for optimism about radical interpretability”.
Am I correct in thinking the ‘ersatz’ and ‘real’ interpretability might differ in aspects more than just degree of interpretability—Ersatz is somewhat embedded in explaining the typically case, whereas ‘real interpretability’ gives good reasoning eve in the worst-case. Interpretability might be hard to achieve in worst-case scenarios where some atypical wiring leads to wrong decisions?
Furthermore, I suspect confusing transparency for interpretability. Even if we understand what each-and-every-neuron does (radical transparency), it might not be interpretable if it seems gibberish.
If it seems correct so far, I elaborate on these here.
Newbie question: What’s the difference between transparency and interpretability? Follow-up question: Does everyone agree with that answer or is it not standardized?
They’re likely to be interchangeable, sorry. Here I might’ve misused the words to try tease out the difference that simply understanding how a given model works is not really insightful if the patterns are not understandable.
I think there are these nonsensical-seeming-patterns to humans might be a significant fraction of the learned patterns by deep networks. I was trying to understand the radical optimism, in contrast to my pessimism given this. The crux being since we don’t know what these patterns are and what they represent, even if we figure out what neurons detect them and which tasks they contribute most to, might not be able to do downstream tasks we require transparency for, like diagnose possible issues and provide solutions.
I think you’re pointing to a special case of a more general pattern.
Just like there’s a general factor of “athletic ability” which can be subdivided into many correlated components, I think “interpretability” can be split into many correlated components. Some of those components correspond to greater reliability in the worst case scenarios. Others might correspond to, e.g., predicting which architectural modifications would improve performance on typical input.
Worst case reliability is probably the most important component to interpretability, but I’m not sure it makes sense to call it “real interpretability”. It’s probably better to just call it “worst case” interpretability.
I want to draw a distinction between two things:
What I call “ersatz interpretability” is when the human figures out some “clues” about what the trained model is doing, like “right now it’s a safe bet that the AI thinks there’s a curve in the picture” or “right now the AI is probably thinking a thought that it believes is high-reward”.
What I call “real interpretability” is the ultimate goal of really understanding what a trained model is doing, why, and how, from top to bottom.
I’m optimistic that we’ll get more than zero out of ersatz interpretability, even as we scale to superintelligent AGI. As you point out, in model-based RL, just looking at the value function gives some nice information about how the AI is assessing its own prospects. That’s trivial, and useful, I seem to recall that in the movie about AlphaGo, the DeepMind engineers were watching AlphaGo’s value function to see whether it expected to win or lose.
Or if we want a strong but imperfect hint that our fancy AI is thinking about blueberries, we can just train a classifier to look at the whole array of network activations, where the supervisory signal is a simple ConvNet blueberry classifier with the same field-of-view as the fancy AI. It won’t be perfect—maybe the AI is imagining blueberries without looking at them or whatever. But it won’t be totally useless either. In fact I’m an advocate for literally having one such classifier for every word in the dictionary, at minimum. (See here, here (search for the words “more dakka”), and more discussion in a forthcoming post.)
At the same time, I’m somewhat pessimistic about “real interpretability”, or at least I’m concerned that it will hit a wall at some point. There are concepts in Ed Witten’s head when he thinks about string theory that I’m just not going to be able to really understand, beyond “this concept has something vaguely to do with this other concept which has something vaguely to do with axions” or whatever.
I think real and ersatz interpretability represent different points on a spectrum, representing different levels of completeness. Each model has a huge collection of factors that decide its behavior. Better explanations abstract away more of those factors in a way humans can understand and use to accurately predict model behavior. Worse explanations cover fewer factors and are less able to reliably predict model behavior.
I’m relatively optimistic about how far we can get with real interpretability. Much of that comes from thinking that we can get pretty far with approaches we currently consider extreme. E.g., I think we can do something like knowledge distillation from AIs to humans by feeding AI internal activations to human brains through channels with wider bandwidth than visual senses. I.e., either through the peripheral nervous system or (more riskily) directly via brain machine interface. So if you have an unknowable concept in an AI, you can target the knowledge distillation process at the concept and learn appropriate intuitions for representing and integrating the concept directly from the AI’s own representations.
I intend to further explore ideas in this space in a future post. Probably, I’ll title it “The case for optimism about radical interpretability”.
Am I correct in thinking the ‘ersatz’ and ‘real’ interpretability might differ in aspects more than just degree of interpretability—Ersatz is somewhat embedded in explaining the typically case, whereas ‘real interpretability’ gives good reasoning eve in the worst-case. Interpretability might be hard to achieve in worst-case scenarios where some atypical wiring leads to wrong decisions?
Furthermore, I suspect confusing transparency for interpretability. Even if we understand what each-and-every-neuron does (radical transparency), it might not be interpretable if it seems gibberish.
If it seems correct so far, I elaborate on these here.
Newbie question: What’s the difference between transparency and interpretability? Follow-up question: Does everyone agree with that answer or is it not standardized?
They’re likely to be interchangeable, sorry. Here I might’ve misused the words to try tease out the difference that simply understanding how a given model works is not really insightful if the patterns are not understandable.
I think there are these nonsensical-seeming-patterns to humans might be a significant fraction of the learned patterns by deep networks. I was trying to understand the radical optimism, in contrast to my pessimism given this. The crux being since we don’t know what these patterns are and what they represent, even if we figure out what neurons detect them and which tasks they contribute most to, might not be able to do downstream tasks we require transparency for, like diagnose possible issues and provide solutions.
I think you’re pointing to a special case of a more general pattern.
Just like there’s a general factor of “athletic ability” which can be subdivided into many correlated components, I think “interpretability” can be split into many correlated components. Some of those components correspond to greater reliability in the worst case scenarios. Others might correspond to, e.g., predicting which architectural modifications would improve performance on typical input.
Worst case reliability is probably the most important component to interpretability, but I’m not sure it makes sense to call it “real interpretability”. It’s probably better to just call it “worst case” interpretability.