I feel like there is a valid point here about how one aspect of interpretability is “Can the model report low-confidence (or no confidence) vs high-confidence appropriately?”
My intuition is that this failure mode is a bit more likely-by-default in a deep neural net than in a hand-crafted logic model. That doesn’t seem like an insurmountable challenge, but certainly something we should keep in mind.
Overall, this article and the discussion in the comments seems to boil down to “yeah, deep neural nets are not (complexity held constant) probably not a lot harder (just somewhat harder) to interpret than big Bayes net blobs.”
I think this is probably true, but is missing a critical point. The critical point is that expansion of compute hardware and improvement of machine learning algorithms has allowed us to generate deep neural nets with the ability to make useful decisions in the world but also a HUGE amount of complexity.
The value of what John Wentworth is saying here, in my eyes, is that we wouldn’t have solved the interpretability problem even if we could magically transform our deep neural net into a nicely labelled billion node bayes net. Even if every node had an accompanying plain text description a few paragraphs long which allowed us to pretty closely translate the values of that particular node into real world observations (i.e. it was well symbol-grounded). We’d still be overwhelmed by the complexity. Would it be ‘more’ interpretable? I’d say yes, thus I’d disagree with the strong claim of ‘exactly as interpretable with complexity held constant’. Would it be enough more interpretable such that it would make sense to blindly trust this enormous flowchart with critical decisions involving the fate of humanity? I’d say no.
So there’s several different valid aspects of interpretability being discussed across the comments here:
Alex Khripin’s discussion of robustness (perhaps paraphrasable as ‘trustworthy outputs over all possible inputs, no matter how far out-of-training-distribution’?)
Ash Gray’s discussion of symbol grounding. I think it’s valid to say that there is an implication that a hand-crafted or well-generated bayes net will be reasonably well symbol grounded. If it weren’t, I’d say it was poor quality. A deep neural net doesn’t give you this by default, but it isn’t implausible to generate that symbol grounding. That is additional work that needs to be done though, and an additional potential point of failure. So, addressable? probably yes, but...
DragonGod and JohnWentworth discussing “complexity held same, is the bayes net / decision flowchart a bit more interpretable?” I’d say probably yes, but....
Stephen Brynes point that challenge-level of task held constant, probably a slightly less complex (fewer paramenters/nodes) bayes net could accomplish the equivalent quality of result? I’d say probably yes, but...
And the big ‘but’ here is that mind-bogglingly huge amount of complexity, the remaining interpretability gap from models simple enough to wrap our heads around to those SOTA models well beyond our comprehension threshold. I don’t think we are even close enough to understanding these very large models well enough to trust them on s-risk (much less x-risk) level issues even on-distribution, much less declare them ‘robust’ enough for off-distribution use. Which is a significant problem, since the big problems humanity faces tend to be inherently off-distribution since they’re about planning actions for the future, and the future is inherently at least potentially off-distribution.
I think if we had 1000 abstract units of ‘interpretability gap’ to close before we were safe to proceed with using big models for critical decisions, my guess is that transforming the deep neural net into a fully labelled, well symbol-grounded, slightly (10% ? 20%?) less complex, slightly more interpretable bayes net would get us something like 1 − 5 units closer. If the ‘hard assertion’ made by John Wentworth’s original article (which I don’t think, based on his reponses to comments is what he is intending), then the ‘hard assertion’ would say 0 units closer. I think the soft assertion, that I think John Wentworth would endorse, and which I would agree with, is something more like ‘that change alone would make only a trivial difference, even if implemented perfectly’.
Addendum: I do believe that there are potentially excellent synergies between various strategies. While I think the convert-nn-to-labelled-bayes-net strategy might be worth just 5/1000 on its own, it might combine multiplicatively with several other strategies, each worth a similar amount alone. So if you do have an idea for how to accomplish this conversion strategy, please don’t let this discussion deter you from posting that.
I feel like there is a valid point here about how one aspect of interpretability is “Can the model report low-confidence (or no confidence) vs high-confidence appropriately?”
My intuition is that this failure mode is a bit more likely-by-default in a deep neural net than in a hand-crafted logic model. That doesn’t seem like an insurmountable challenge, but certainly something we should keep in mind.
Overall, this article and the discussion in the comments seems to boil down to “yeah, deep neural nets are not (complexity held constant) probably not a lot harder (just somewhat harder) to interpret than big Bayes net blobs.”
I think this is probably true, but is missing a critical point. The critical point is that expansion of compute hardware and improvement of machine learning algorithms has allowed us to generate deep neural nets with the ability to make useful decisions in the world but also a HUGE amount of complexity.
The value of what John Wentworth is saying here, in my eyes, is that we wouldn’t have solved the interpretability problem even if we could magically transform our deep neural net into a nicely labelled billion node bayes net. Even if every node had an accompanying plain text description a few paragraphs long which allowed us to pretty closely translate the values of that particular node into real world observations (i.e. it was well symbol-grounded). We’d still be overwhelmed by the complexity. Would it be ‘more’ interpretable? I’d say yes, thus I’d disagree with the strong claim of ‘exactly as interpretable with complexity held constant’. Would it be enough more interpretable such that it would make sense to blindly trust this enormous flowchart with critical decisions involving the fate of humanity? I’d say no.
So there’s several different valid aspects of interpretability being discussed across the comments here:
Alex Khripin’s discussion of robustness (perhaps paraphrasable as ‘trustworthy outputs over all possible inputs, no matter how far out-of-training-distribution’?)
Ash Gray’s discussion of symbol grounding. I think it’s valid to say that there is an implication that a hand-crafted or well-generated bayes net will be reasonably well symbol grounded. If it weren’t, I’d say it was poor quality. A deep neural net doesn’t give you this by default, but it isn’t implausible to generate that symbol grounding. That is additional work that needs to be done though, and an additional potential point of failure. So, addressable? probably yes, but...
DragonGod and JohnWentworth discussing “complexity held same, is the bayes net / decision flowchart a bit more interpretable?” I’d say probably yes, but....
Stephen Brynes point that challenge-level of task held constant, probably a slightly less complex (fewer paramenters/nodes) bayes net could accomplish the equivalent quality of result? I’d say probably yes, but...
And the big ‘but’ here is that mind-bogglingly huge amount of complexity, the remaining interpretability gap from models simple enough to wrap our heads around to those SOTA models well beyond our comprehension threshold. I don’t think we are even close enough to understanding these very large models well enough to trust them on s-risk (much less x-risk) level issues even on-distribution, much less declare them ‘robust’ enough for off-distribution use. Which is a significant problem, since the big problems humanity faces tend to be inherently off-distribution since they’re about planning actions for the future, and the future is inherently at least potentially off-distribution.
I think if we had 1000 abstract units of ‘interpretability gap’ to close before we were safe to proceed with using big models for critical decisions, my guess is that transforming the deep neural net into a fully labelled, well symbol-grounded, slightly (10% ? 20%?) less complex, slightly more interpretable bayes net would get us something like 1 − 5 units closer. If the ‘hard assertion’ made by John Wentworth’s original article (which I don’t think, based on his reponses to comments is what he is intending), then the ‘hard assertion’ would say 0 units closer. I think the soft assertion, that I think John Wentworth would endorse, and which I would agree with, is something more like ‘that change alone would make only a trivial difference, even if implemented perfectly’.
Addendum: I do believe that there are potentially excellent synergies between various strategies. While I think the convert-nn-to-labelled-bayes-net strategy might be worth just 5/1000 on its own, it might combine multiplicatively with several other strategies, each worth a similar amount alone. So if you do have an idea for how to accomplish this conversion strategy, please don’t let this discussion deter you from posting that.
This is a really good summary, thankyou.