“defining goals (or meta-goals, or meta-meta-goals) in machine code” or the “grounding everything in code” problems.
An AI that is super intelligent will “know what I mean” when I tell it to do something. The difficulty is specifying the AI’s goals (at compile time / in machine code) so that the AI “wants” to do what I mean.
Solving the “specify the correct goals in machine code” is thus necessary and sufficient for making a friendly AI. A lot of my arguments depend on this claim.
How to specify goals at compile time is a technical question, but we can do some a priori theorizing as to how we might do it. Roughly, there are two high level approaches of how to go about it. Simple hard-coded goals, and goals fed in from more complex modules. A simple hard-coded goal might be something like current reinforcement learners were the reward signal is human praise (or, a simple to hard-code proxy for human praise such as pressing a reward button). The other alternative is to make a a few modules (e.g. one for natural language understanding, one for modeling humans) and “use it/them as part of the definition of the new AI’s motivation.”).
Responses to counterarguments:
4.1: needing to specify commands carefully (e.g. “give humans what they really want”.).
And then of course there’s those orders where humans really don’t understand what they themselves want...
The whole point of intelligence is being able to specify tasks in an ambiguous way (e.g. you don’t have to specify what you want in such detail that you’re practically programming a computer). An AI that actually wants to make you happier (since it’s goals were specified at compile time using a module that models humans) will ask you what to clarify your intentions if you give it vague goals.
Some other thoughts:
For it to have any chance of success, we need to be sure that both model-as-definition and the intelligence module idea are rigorously defined.
It will be hard to accomplish this, since nobody knows how to go about building such modules. Modeling language, humans, and human values are hard problems. Building the modules is a technical question. But, it is necessary and sufficient to build the modules and feed them into the goal system of another AI to build a friendly AI. In fact, one could make a stronger argument that any AGI that’s built with a goal system must have it’s goal system specified with natural language modules (e.g. reinforcement learning sucks). Thus, it is likely that any built AGIs would be FAIs.
EDITED to add: Tool-AI arguments.
If you can build the modules to feed into an AI with a goal system, then you might be able to build a “tool-AI” that doesn’t a goal system. I think it’s hard to say a priori that such an architecture isn’t more likely than an architecture that requires a goal system. It’s even harder to say that a tool-AI architecture is impossible to build.
In summary, I think the chief issues with building friendly AI are technical issues related to actually building the AI. I don’t see how decision theory helps. I do think that unfriendly humans with a tool AI is something to be concerned about, but doing math research doesn’t seem related to that (Incidentally, MIRI’s math research has intrigued people like Elon Musk, which helps with the “unfriendly humans problem”).
Let me play devil’s advocate for this position.
An AI that is super intelligent will “know what I mean” when I tell it to do something. The difficulty is specifying the AI’s goals (at compile time / in machine code) so that the AI “wants” to do what I mean.
Solving the “specify the correct goals in machine code” is thus necessary and sufficient for making a friendly AI. A lot of my arguments depend on this claim.
How to specify goals at compile time is a technical question, but we can do some a priori theorizing as to how we might do it. Roughly, there are two high level approaches of how to go about it. Simple hard-coded goals, and goals fed in from more complex modules. A simple hard-coded goal might be something like current reinforcement learners were the reward signal is human praise (or, a simple to hard-code proxy for human praise such as pressing a reward button). The other alternative is to make a a few modules (e.g. one for natural language understanding, one for modeling humans) and “use it/them as part of the definition of the new AI’s motivation.”).
Responses to counterarguments:
4.1: needing to specify commands carefully (e.g. “give humans what they really want”.).
The whole point of intelligence is being able to specify tasks in an ambiguous way (e.g. you don’t have to specify what you want in such detail that you’re practically programming a computer). An AI that actually wants to make you happier (since it’s goals were specified at compile time using a module that models humans) will ask you what to clarify your intentions if you give it vague goals.
Some other thoughts:
It will be hard to accomplish this, since nobody knows how to go about building such modules. Modeling language, humans, and human values are hard problems. Building the modules is a technical question. But, it is necessary and sufficient to build the modules and feed them into the goal system of another AI to build a friendly AI. In fact, one could make a stronger argument that any AGI that’s built with a goal system must have it’s goal system specified with natural language modules (e.g. reinforcement learning sucks). Thus, it is likely that any built AGIs would be FAIs.
EDITED to add: Tool-AI arguments. If you can build the modules to feed into an AI with a goal system, then you might be able to build a “tool-AI” that doesn’t a goal system. I think it’s hard to say a priori that such an architecture isn’t more likely than an architecture that requires a goal system. It’s even harder to say that a tool-AI architecture is impossible to build.
In summary, I think the chief issues with building friendly AI are technical issues related to actually building the AI. I don’t see how decision theory helps. I do think that unfriendly humans with a tool AI is something to be concerned about, but doing math research doesn’t seem related to that (Incidentally, MIRI’s math research has intrigued people like Elon Musk, which helps with the “unfriendly humans problem”).