Ideally, that “human interpretable” representation would itself be something mathematical, rather than just e.g. natural language, since mathematical representations (broadly interpreted, so including e.g. Python) are basically the only representations which enable robust engineering in practice.
That side of the problem—what the “human interpretable” side of neural-net-concepts-translated-into-something-human-interpretable looks like—is also a major subproblem of “understanding abstraction”.
The tractability of this decomposition (human language → intermediate formalizable representation; LM representations → intermediate formalizable representation) seems bad to me, perhaps even less tractable than e.g. enumerative mech interp proposals. I’m not even sure I can picture where one would start to e.g. represent helpfulness in Python, seems kinda GOFAI-complete. I’m also unsure why I should trust this kind of methodology more than e.g. direct brain-LM comparisons.
The tractability of this decomposition (human language → intermediate formalizable representation; LM representations → intermediate formalizable representation) seems bad to me, perhaps even less tractable than e.g. enumerative mech interp proposals. I’m not even sure I can picture where one would start to e.g. represent helpfulness in Python, seems kinda GOFAI-complete. I’m also unsure why I should trust this kind of methodology more than e.g. direct brain-LM comparisons.
That’s a really good point. I would like to see John address it, because it seems quite crucial for the overall alignment plan.