This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI’s infrastructure.
It’s almost trivially simple. Have the AI construct a module that models humans and models human understanding (including natural language understanding). This is the kind of thing that any AI would want to do, whatever its goals were.
Then take that module (using corrigibility) into another AI, and use it as part of the definition of the new AI’s motivation. The new AI will then use this module to follow instruction humans give it in natural language.
Too easy?...
This approach essentially solves the whole friendly AI problem, loading it onto the AI in a way that avoids the whole “defining goals (or meta-goals, or meta-meta-goals) in machine code” or the “grounding everything in code” problems. As such it is extremely seductive, and will sound better, and easier, than it likely is.
I expect this approach to fail. For it to have any chance of success, we need to be sure that both model-as-definition and the intelligence module idea are rigorously defined. Then we have to have a good understanding of the various ways how the approach might fail, before we can even begin to talk about how it might succeed.
The first issue that springs to mind is when multiple definitions fit the AI’s model of human intentions and understanding. We might want the AI to try and accomplish all the things it is asked to do, according to all the definitions. Therefore, similarly to this post, we want to phrase the instructions carefully so that a “bad instantiation” simply means the AI does something pointless, rather than something negative. Eg “Give humans something nice” seems much safer than “give humans what they really want”.
And then of course there’s those orders where humans really don’t understand what they themselves want...
I’d want a lot more issues like that discussed and solved, before I’d recommend using this approach to getting a safe FAI.
Values at compile time
A putative new idea for AI control; index here.
This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI’s infrastructure.
It’s almost trivially simple. Have the AI construct a module that models humans and models human understanding (including natural language understanding). This is the kind of thing that any AI would want to do, whatever its goals were.
Then take that module (using corrigibility) into another AI, and use it as part of the definition of the new AI’s motivation. The new AI will then use this module to follow instruction humans give it in natural language.
Too easy?...
This approach essentially solves the whole friendly AI problem, loading it onto the AI in a way that avoids the whole “defining goals (or meta-goals, or meta-meta-goals) in machine code” or the “grounding everything in code” problems. As such it is extremely seductive, and will sound better, and easier, than it likely is.
I expect this approach to fail. For it to have any chance of success, we need to be sure that both model-as-definition and the intelligence module idea are rigorously defined. Then we have to have a good understanding of the various ways how the approach might fail, before we can even begin to talk about how it might succeed.
The first issue that springs to mind is when multiple definitions fit the AI’s model of human intentions and understanding. We might want the AI to try and accomplish all the things it is asked to do, according to all the definitions. Therefore, similarly to this post, we want to phrase the instructions carefully so that a “bad instantiation” simply means the AI does something pointless, rather than something negative. Eg “Give humans something nice” seems much safer than “give humans what they really want”.
And then of course there’s those orders where humans really don’t understand what they themselves want...
I’d want a lot more issues like that discussed and solved, before I’d recommend using this approach to getting a safe FAI.