As I see it, there are two fundamental problems here:
1.) Generating an interpretable expert system code for an AGI is probably already AGI complete. It seems unlikely that a non-AGI DL model can output code for an AGI—especially given that it is highly unlikely that there would be expert system AGIs in its training set—or even things close to expert-system AGIs if deep learning keeps far out pacing GOFAI techniques.
2.) Building an interpretable expert system AGI is likely not just AGI complete but a fundamentally much harder problem than building a DL AGI system. Intelligence is extremely detailed, messy, and highly heuristic. All our examples of intelligent behaviour come from large blobs of optimized compute -- both brains and DL systems—and none from expert systems. The actual inner workings of intelligence might just be fundamentally uninterpretable in their complexity except at a high level—i.e. ‘this optimized blob is the output of approximate bayesian inference over this extremely massive dataset’
This is obviously true; any AI complete problem can be trivially reduced to the problem of writing an AI program that solves the problem. That isn’t really a problem for the proposal here. The point isn’t that we could avoid making AGI by doing this, the point is that we can do this in order to get AI systems that we can trust without having to solve interpretability.
This is probably true, but the extent to which it is true is unclear. Moreover, if the inner workings of intelligence are fundamentally uninterpretable, then strong interpretability must also fail. I already commented on this in the last two paragraphs of the top-level post.
This is obviously true; any AI complete problem can be trivially reduced to the problem of writing an AI program that solves the problem. That isn’t really a problem for the proposal here. The point isn’t that we could avoid making AGI by doing this, the point is that we can do this in order to get AI systems that we can trust without having to solve interpretability.
Maybe I’m being silly but then I don’t understand the safety properties of this approach. If we need an AGI based on uninterpretable DL to build this, then how do we first check if this AGI is safe?
The point is that you (in theory) don’t need to know whether or not the uninterpretable AGI is safe, if you are able to independently verify its output (similarly to how I can trust a mathematical proof, without trusting the mathematician).
Of course, in practice, the uninterpretable AGI presumably needs to be reasonably aligned for this to work. You must at the very least be able to motivate it to write code for you, without hiding any trojans or backdoors that you are not able to detect.
However, I think that this is likely to be much easier than solving the full alignment problem for sovereign agents. Writing software is a myopic task that can be accomplished without persistent, agentic preferences, which means that the base system could be much more tool-like that the system which it produces.
As I see it, there are two fundamental problems here:
1.) Generating an interpretable expert system code for an AGI is probably already AGI complete. It seems unlikely that a non-AGI DL model can output code for an AGI—especially given that it is highly unlikely that there would be expert system AGIs in its training set—or even things close to expert-system AGIs if deep learning keeps far out pacing GOFAI techniques.
2.) Building an interpretable expert system AGI is likely not just AGI complete but a fundamentally much harder problem than building a DL AGI system. Intelligence is extremely detailed, messy, and highly heuristic. All our examples of intelligent behaviour come from large blobs of optimized compute -- both brains and DL systems—and none from expert systems. The actual inner workings of intelligence might just be fundamentally uninterpretable in their complexity except at a high level—i.e. ‘this optimized blob is the output of approximate bayesian inference over this extremely massive dataset’
This is obviously true; any AI complete problem can be trivially reduced to the problem of writing an AI program that solves the problem. That isn’t really a problem for the proposal here. The point isn’t that we could avoid making AGI by doing this, the point is that we can do this in order to get AI systems that we can trust without having to solve interpretability.
This is probably true, but the extent to which it is true is unclear. Moreover, if the inner workings of intelligence are fundamentally uninterpretable, then strong interpretability must also fail. I already commented on this in the last two paragraphs of the top-level post.
Maybe I’m being silly but then I don’t understand the safety properties of this approach. If we need an AGI based on uninterpretable DL to build this, then how do we first check if this AGI is safe?
The point is that you (in theory) don’t need to know whether or not the uninterpretable AGI is safe, if you are able to independently verify its output (similarly to how I can trust a mathematical proof, without trusting the mathematician).
Of course, in practice, the uninterpretable AGI presumably needs to be reasonably aligned for this to work. You must at the very least be able to motivate it to write code for you, without hiding any trojans or backdoors that you are not able to detect.
However, I think that this is likely to be much easier than solving the full alignment problem for sovereign agents. Writing software is a myopic task that can be accomplished without persistent, agentic preferences, which means that the base system could be much more tool-like that the system which it produces.
But regardless of that point, many arguments for why interpretability research will be helpful also apply to the strategy I outline above.