You use the word robustness a lot, but interpretability is related to the opposite of robustness.
When your tree detector says a tree is a tree, nobody will complain. The importance of interpretability is in understanding why it might be wrong, in either direction—either before or after the fact.
If your hand written tree detector relies on identifying green pixels, then you can say up front that it won’t work in deciduous forests in autumn and winter. That’s not robust, but it’s interpretable. You can analyze causality from inputs to outputs (though this gets progressively more difficult). You may also be able to say with confidence that changing a single pixel will have limited or no effect.
The extreme of this are safety systems where the effect of input state (both expected, and unexpected, like broken sensors) on the output is supposed to be bounded and well characterized.
I can offer a very minimal example from my own experience training a relatively simple neural net to approximate a physical system. This was simple enough that a purely linear model was a decent starting point. For the linear model—and for a variety of simple non linear models I could use—it would be trivial to guarantee that, for example, the behavior of the approximation would be smooth and monotonic in areas where I didn’t have data. A sufficiently complex network, on the other hand, needed considerable effort to guarantee such behavior.
You’re correct that the labels in a “simple” decision tree are hiding a lot of complexity—but, for classical systems, usually they are themselves are coming from simple labeling methods. “Season” may come from a lookup in a calendar. “Sprinkler” may come from a landscaping schedule or a sensor attached to the valve controlling the sprinkler.
Deep learning is encouraged to break this sort of compartmentalization in the interest of greater accuracy. The sprinkler label may override the valve signal, which promises to make the system robust to a bad sensor—but how it chooses to do so based on training data may be hard to discern. The opposite may be true as well—the hand written system may anticipate failure modes of the sensor that there is no training data on.
If you look at any well written networking code, for example, it will be handling network errors that may be extremely unlikely. It may even respond to error codes that cannot currently happen—for example, a specific one specified by the OS but not yet implemented. When the vendor announces that the new feature is supported, you can look at the code and verify that behavior will still be correct.
To summarize—interpretability is about debugging and verification. Those are stepping stones towards robustness, but robustness is a much higher bar.
If your hand written tree detector relies on identifying green pixels, then you can say up front that it won’t work in deciduous forests in autumn and winter. That’s not robust, but it’s interpretable.
I would call that “not interpretable”, because the interpretation of that detector as a tree-detector is wrong. If the internal-thing does not robustly track the external-thing which it supposedly represents, then I’d call that “not interpretable” (or at least not interpretable as a representation of the external-thing); if we try to interpret it as representing the external-thing then we will shoot ourselves in the foot.
Obviously, it’s an exaggerated failure mode. But all systems have failure modes, and are meant to be used under some range of inputs. A more realistic requirement may be night versus day images. A tree detector that only works in daylight is perfectly usable.
The capabilities and limitations of a deep learned network are partly hidden in the input data. The autumn example is an exaggeration, but there may very well be species of tree that are not well represented in your inputs. How can you tell how well they will be recognized? And, if a particular sequoia is deemed not a tree—can you tell why?
I feel like there is a valid point here about how one aspect of interpretability is “Can the model report low-confidence (or no confidence) vs high-confidence appropriately?”
My intuition is that this failure mode is a bit more likely-by-default in a deep neural net than in a hand-crafted logic model. That doesn’t seem like an insurmountable challenge, but certainly something we should keep in mind.
Overall, this article and the discussion in the comments seems to boil down to “yeah, deep neural nets are not (complexity held constant) probably not a lot harder (just somewhat harder) to interpret than big Bayes net blobs.”
I think this is probably true, but is missing a critical point. The critical point is that expansion of compute hardware and improvement of machine learning algorithms has allowed us to generate deep neural nets with the ability to make useful decisions in the world but also a HUGE amount of complexity.
The value of what John Wentworth is saying here, in my eyes, is that we wouldn’t have solved the interpretability problem even if we could magically transform our deep neural net into a nicely labelled billion node bayes net. Even if every node had an accompanying plain text description a few paragraphs long which allowed us to pretty closely translate the values of that particular node into real world observations (i.e. it was well symbol-grounded). We’d still be overwhelmed by the complexity. Would it be ‘more’ interpretable? I’d say yes, thus I’d disagree with the strong claim of ‘exactly as interpretable with complexity held constant’. Would it be enough more interpretable such that it would make sense to blindly trust this enormous flowchart with critical decisions involving the fate of humanity? I’d say no.
So there’s several different valid aspects of interpretability being discussed across the comments here:
Alex Khripin’s discussion of robustness (perhaps paraphrasable as ‘trustworthy outputs over all possible inputs, no matter how far out-of-training-distribution’?)
Ash Gray’s discussion of symbol grounding. I think it’s valid to say that there is an implication that a hand-crafted or well-generated bayes net will be reasonably well symbol grounded. If it weren’t, I’d say it was poor quality. A deep neural net doesn’t give you this by default, but it isn’t implausible to generate that symbol grounding. That is additional work that needs to be done though, and an additional potential point of failure. So, addressable? probably yes, but...
DragonGod and JohnWentworth discussing “complexity held same, is the bayes net / decision flowchart a bit more interpretable?” I’d say probably yes, but....
Stephen Brynes point that challenge-level of task held constant, probably a slightly less complex (fewer paramenters/nodes) bayes net could accomplish the equivalent quality of result? I’d say probably yes, but...
And the big ‘but’ here is that mind-bogglingly huge amount of complexity, the remaining interpretability gap from models simple enough to wrap our heads around to those SOTA models well beyond our comprehension threshold. I don’t think we are even close enough to understanding these very large models well enough to trust them on s-risk (much less x-risk) level issues even on-distribution, much less declare them ‘robust’ enough for off-distribution use. Which is a significant problem, since the big problems humanity faces tend to be inherently off-distribution since they’re about planning actions for the future, and the future is inherently at least potentially off-distribution.
I think if we had 1000 abstract units of ‘interpretability gap’ to close before we were safe to proceed with using big models for critical decisions, my guess is that transforming the deep neural net into a fully labelled, well symbol-grounded, slightly (10% ? 20%?) less complex, slightly more interpretable bayes net would get us something like 1 − 5 units closer. If the ‘hard assertion’ made by John Wentworth’s original article (which I don’t think, based on his reponses to comments is what he is intending), then the ‘hard assertion’ would say 0 units closer. I think the soft assertion, that I think John Wentworth would endorse, and which I would agree with, is something more like ‘that change alone would make only a trivial difference, even if implemented perfectly’.
Addendum: I do believe that there are potentially excellent synergies between various strategies. While I think the convert-nn-to-labelled-bayes-net strategy might be worth just 5/1000 on its own, it might combine multiplicatively with several other strategies, each worth a similar amount alone. So if you do have an idea for how to accomplish this conversion strategy, please don’t let this discussion deter you from posting that.
You use the word robustness a lot, but interpretability is related to the opposite of robustness.
When your tree detector says a tree is a tree, nobody will complain. The importance of interpretability is in understanding why it might be wrong, in either direction—either before or after the fact.
If your hand written tree detector relies on identifying green pixels, then you can say up front that it won’t work in deciduous forests in autumn and winter. That’s not robust, but it’s interpretable. You can analyze causality from inputs to outputs (though this gets progressively more difficult). You may also be able to say with confidence that changing a single pixel will have limited or no effect.
The extreme of this are safety systems where the effect of input state (both expected, and unexpected, like broken sensors) on the output is supposed to be bounded and well characterized.
I can offer a very minimal example from my own experience training a relatively simple neural net to approximate a physical system. This was simple enough that a purely linear model was a decent starting point. For the linear model—and for a variety of simple non linear models I could use—it would be trivial to guarantee that, for example, the behavior of the approximation would be smooth and monotonic in areas where I didn’t have data. A sufficiently complex network, on the other hand, needed considerable effort to guarantee such behavior.
You’re correct that the labels in a “simple” decision tree are hiding a lot of complexity—but, for classical systems, usually they are themselves are coming from simple labeling methods. “Season” may come from a lookup in a calendar. “Sprinkler” may come from a landscaping schedule or a sensor attached to the valve controlling the sprinkler.
Deep learning is encouraged to break this sort of compartmentalization in the interest of greater accuracy. The sprinkler label may override the valve signal, which promises to make the system robust to a bad sensor—but how it chooses to do so based on training data may be hard to discern. The opposite may be true as well—the hand written system may anticipate failure modes of the sensor that there is no training data on.
If you look at any well written networking code, for example, it will be handling network errors that may be extremely unlikely. It may even respond to error codes that cannot currently happen—for example, a specific one specified by the OS but not yet implemented. When the vendor announces that the new feature is supported, you can look at the code and verify that behavior will still be correct.
To summarize—interpretability is about debugging and verification. Those are stepping stones towards robustness, but robustness is a much higher bar.
I would call that “not interpretable”, because the interpretation of that detector as a tree-detector is wrong. If the internal-thing does not robustly track the external-thing which it supposedly represents, then I’d call that “not interpretable” (or at least not interpretable as a representation of the external-thing); if we try to interpret it as representing the external-thing then we will shoot ourselves in the foot.
Obviously, it’s an exaggerated failure mode. But all systems have failure modes, and are meant to be used under some range of inputs. A more realistic requirement may be night versus day images. A tree detector that only works in daylight is perfectly usable.
The capabilities and limitations of a deep learned network are partly hidden in the input data. The autumn example is an exaggeration, but there may very well be species of tree that are not well represented in your inputs. How can you tell how well they will be recognized? And, if a particular sequoia is deemed not a tree—can you tell why?
I feel like there is a valid point here about how one aspect of interpretability is “Can the model report low-confidence (or no confidence) vs high-confidence appropriately?”
My intuition is that this failure mode is a bit more likely-by-default in a deep neural net than in a hand-crafted logic model. That doesn’t seem like an insurmountable challenge, but certainly something we should keep in mind.
Overall, this article and the discussion in the comments seems to boil down to “yeah, deep neural nets are not (complexity held constant) probably not a lot harder (just somewhat harder) to interpret than big Bayes net blobs.”
I think this is probably true, but is missing a critical point. The critical point is that expansion of compute hardware and improvement of machine learning algorithms has allowed us to generate deep neural nets with the ability to make useful decisions in the world but also a HUGE amount of complexity.
The value of what John Wentworth is saying here, in my eyes, is that we wouldn’t have solved the interpretability problem even if we could magically transform our deep neural net into a nicely labelled billion node bayes net. Even if every node had an accompanying plain text description a few paragraphs long which allowed us to pretty closely translate the values of that particular node into real world observations (i.e. it was well symbol-grounded). We’d still be overwhelmed by the complexity. Would it be ‘more’ interpretable? I’d say yes, thus I’d disagree with the strong claim of ‘exactly as interpretable with complexity held constant’. Would it be enough more interpretable such that it would make sense to blindly trust this enormous flowchart with critical decisions involving the fate of humanity? I’d say no.
So there’s several different valid aspects of interpretability being discussed across the comments here:
Alex Khripin’s discussion of robustness (perhaps paraphrasable as ‘trustworthy outputs over all possible inputs, no matter how far out-of-training-distribution’?)
Ash Gray’s discussion of symbol grounding. I think it’s valid to say that there is an implication that a hand-crafted or well-generated bayes net will be reasonably well symbol grounded. If it weren’t, I’d say it was poor quality. A deep neural net doesn’t give you this by default, but it isn’t implausible to generate that symbol grounding. That is additional work that needs to be done though, and an additional potential point of failure. So, addressable? probably yes, but...
DragonGod and JohnWentworth discussing “complexity held same, is the bayes net / decision flowchart a bit more interpretable?” I’d say probably yes, but....
Stephen Brynes point that challenge-level of task held constant, probably a slightly less complex (fewer paramenters/nodes) bayes net could accomplish the equivalent quality of result? I’d say probably yes, but...
And the big ‘but’ here is that mind-bogglingly huge amount of complexity, the remaining interpretability gap from models simple enough to wrap our heads around to those SOTA models well beyond our comprehension threshold. I don’t think we are even close enough to understanding these very large models well enough to trust them on s-risk (much less x-risk) level issues even on-distribution, much less declare them ‘robust’ enough for off-distribution use. Which is a significant problem, since the big problems humanity faces tend to be inherently off-distribution since they’re about planning actions for the future, and the future is inherently at least potentially off-distribution.
I think if we had 1000 abstract units of ‘interpretability gap’ to close before we were safe to proceed with using big models for critical decisions, my guess is that transforming the deep neural net into a fully labelled, well symbol-grounded, slightly (10% ? 20%?) less complex, slightly more interpretable bayes net would get us something like 1 − 5 units closer. If the ‘hard assertion’ made by John Wentworth’s original article (which I don’t think, based on his reponses to comments is what he is intending), then the ‘hard assertion’ would say 0 units closer. I think the soft assertion, that I think John Wentworth would endorse, and which I would agree with, is something more like ‘that change alone would make only a trivial difference, even if implemented perfectly’.
Addendum: I do believe that there are potentially excellent synergies between various strategies. While I think the convert-nn-to-labelled-bayes-net strategy might be worth just 5/1000 on its own, it might combine multiplicatively with several other strategies, each worth a similar amount alone. So if you do have an idea for how to accomplish this conversion strategy, please don’t let this discussion deter you from posting that.
This is a really good summary, thankyou.