Caspar Oesterheld comments on Using (Uninterpretable) LLMs to Generate Interpretable AI Code

Caspar Oesterheld 18 Sep 2024 17:22 UTC
LW: 1 AF: 1
0
AF

I also don’t think that operations of the form “do X, because on average, this works well” necessarily are problematic, provided that “X” itself can be understood.

Yeah, I think I agree with this and in general with what you say in this paragraph. Along the lines of your footnote, I’m still not quite sure what exactly “X can be understood” must require. It seems to matter, for example, that to a human it’s understandable how the given rule/heuristic or something like the given heuristic could be useful. At least if we specifically think about AI risk, all we really need is that X is interpretable enough that we can tell that it’s not doing anything problematic (?).