johnswentworth comments on Transparency Trichotomy

johnswentworth 28 Mar 2021 21:40 UTC
LW: 18 AF: 9
AF
This post seems to me to be beating around the bush. There’s several different classes of transparency methods evaluated by several different proxy criteria, but this is all sort of tangential to the thing which actually matters: we do not understand what “understanding a model” means, at least not in a sufficiently-robust-and-legible way to confidently put optimization pressure on it.
For transparency via inspection, the problem is that we don’t know what kind of “understanding” is required to rule out bad behavior. We can notice that some low-level features implement a certain kind of filter, and that is definitely a valuable clue to how these things work, but we probably wouldn’t be able to tell if those filters were wired together in a way which implemented a search algorithm.
For transparency via training, the problem is that we don’t know what metric measures the “understandability” of the system in a way sufficient to rule out bad behavior. The tree regularization example is a good one—it is not the right metric, and we don’t know what to replace it with. Use a proxy, get goodharted.
For transparency via architecture, the problem is that we don’t know what architectural features allow “understanding” of the kind needed to rule out bad behavior. Does clusterability actually matter? No idea.
Under the hood, it’s all the same problem.
What links here?
- Modeling the impact of safety agendas by Ben Cottier (5 Nov 2021 19:46 UTC; 51 points)
- Mark Xu 28 Mar 2021 22:12 UTC
  LW: 8 AF: 7
  AF Parent
  I agree it’s sort of the same problem under the hood, but I think knowing how you’re going to go from “understanding understanding” to producing an understandable model controls what type of understanding you’re looking for.
  
  I also agree that this post makes ~0 progress on solving the “hard problem” of transparency, I just think it provides a potentially useful framing and creates a reference for me/others to link to in the future.