beren comments on Using (Uninterpretable) LLMs to Generate Interpretable AI Code

beren 2 Jul 2023 23:51 UTC
2 points
0
This is obviously true; any AI complete problem can be trivially reduced to the problem of writing an AI program that solves the problem. That isn’t really a problem for the proposal here. The point isn’t that we could avoid making AGI by doing this, the point is that we can do this in order to get AI systems that we can trust without having to solve interpretability.
Maybe I’m being silly but then I don’t understand the safety properties of this approach. If we need an AGI based on uninterpretable DL to build this, then how do we first check if this AGI is safe?
- Joar Skalse 8 Jul 2023 8:44 UTC
  3 points
  2
  Parent
  The point is that you (in theory) don’t need to know whether or not the uninterpretable AGI is safe, if you are able to independently verify its output (similarly to how I can trust a mathematical proof, without trusting the mathematician).
  Of course, in practice, the uninterpretable AGI presumably needs to be reasonably aligned for this to work. You must at the very least be able to motivate it to write code for you, without hiding any trojans or backdoors that you are not able to detect.
  However, I think that this is likely to be much easier than solving the full alignment problem for sovereign agents. Writing software is a myopic task that can be accomplished without persistent, agentic preferences, which means that the base system could be much more tool-like that the system which it produces.
  
  But regardless of that point, many arguments for why interpretability research will be helpful also apply to the strategy I outline above.