Bogdan Ionut Cirstea comments on The Plan − 2023 Version

Bogdan Ionut Cirstea 30 Dec 2023 9:52 UTC
6 points
2
Ideally, that “human interpretable” representation would itself be something mathematical, rather than just e.g. natural language, since mathematical representations (broadly interpreted, so including e.g. Python) are basically the only representations which enable robust engineering in practice.
That side of the problem—what the “human interpretable” side of neural-net-concepts-translated-into-something-human-interpretable looks like—is also a major subproblem of “understanding abstraction”.
The tractability of this decomposition (human language → intermediate formalizable representation; LM representations → intermediate formalizable representation) seems bad to me, perhaps even less tractable than e.g. enumerative mech interp proposals. I’m not even sure I can picture where one would start to e.g. represent helpfulness in Python, seems kinda GOFAI-complete. I’m also unsure why I should trust this kind of methodology more than e.g. direct brain-LM comparisons.
- Dakara 19 Nov 2024 8:22 UTC
  3 points
  2
  Parent
  That’s a really good point. I would like to see John address it, because it seems quite crucial for the overall alignment plan.