This post seems to me to be beating around the bush. There’s several different classes of transparency methods evaluated by several different proxy criteria, but this is all sort of tangential to the thing which actually matters: we do not understand what “understanding a model” means, at least not in a sufficiently-robust-and-legible way to confidently put optimization pressure on it.
For transparency via inspection, the problem is that we don’t know what kind of “understanding” is required to rule out bad behavior. We can notice that some low-level features implement a certain kind of filter, and that is definitely a valuable clue to how these things work, but we probably wouldn’t be able to tell if those filters were wired together in a way which implemented a search algorithm.
For transparency via training, the problem is that we don’t know what metric measures the “understandability” of the system in a way sufficient to rule out bad behavior. The tree regularization example is a good one—it is not the right metric, and we don’t know what to replace it with. Use a proxy, get goodharted.
For transparency via architecture, the problem is that we don’t know what architectural features allow “understanding” of the kind needed to rule out bad behavior. Does clusterability actually matter? No idea.
I agree it’s sort of the same problem under the hood, but I think knowing how you’re going to go from “understanding understanding” to producing an understandable model controls what type of understanding you’re looking for.
I also agree that this post makes ~0 progress on solving the “hard problem” of transparency, I just think it provides a potentially useful framing and creates a reference for me/others to link to in the future.
This post seems to me to be beating around the bush. There’s several different classes of transparency methods evaluated by several different proxy criteria, but this is all sort of tangential to the thing which actually matters: we do not understand what “understanding a model” means, at least not in a sufficiently-robust-and-legible way to confidently put optimization pressure on it.
For transparency via inspection, the problem is that we don’t know what kind of “understanding” is required to rule out bad behavior. We can notice that some low-level features implement a certain kind of filter, and that is definitely a valuable clue to how these things work, but we probably wouldn’t be able to tell if those filters were wired together in a way which implemented a search algorithm.
For transparency via training, the problem is that we don’t know what metric measures the “understandability” of the system in a way sufficient to rule out bad behavior. The tree regularization example is a good one—it is not the right metric, and we don’t know what to replace it with. Use a proxy, get goodharted.
For transparency via architecture, the problem is that we don’t know what architectural features allow “understanding” of the kind needed to rule out bad behavior. Does clusterability actually matter? No idea.
Under the hood, it’s all the same problem.
I agree it’s sort of the same problem under the hood, but I think knowing how you’re going to go from “understanding understanding” to producing an understandable model controls what type of understanding you’re looking for.
I also agree that this post makes ~0 progress on solving the “hard problem” of transparency, I just think it provides a potentially useful framing and creates a reference for me/others to link to in the future.