I just realized I misread your above comment and was arguing against the wrong thing somewhat.
it seems likely that “the AI would be very familiar with humans and would have a good idea of actions that would meet human approval.”
Yes the AI would know what we would approve of. It might also know what we want (note these are different things.) But it doesn’t have any reason to care.
At any given point, the AI needs to have a well specified utility function. Or at least something like a utility function. That gives the AI a goal it can optimize for.
With my method, the AI needs to do several things. It needs to predict what a human judge would do, after reading some output it produces. I.e. if they would hit a big button that says “Approve”. It needs to be able to predict what AI 2 will say after reading its output. I.e. what probability AI 2 will predict AI 1′s output is human. And it needs to predict what actions will lead it towards increasing the probability of those things, and take them. AI 2, in turn, just needs to predict one thing. How likely it’s input was produced by a human.
How do you create a well specified utility function for doing things humans would approve of? You just have it optimize the probability the human will press the button that says “approve”, and ditch the part about it pretending to be human.
But the output most likely to make you hit the approve button isn’t necessarily what you really want! It might be full of lies and manipulation, or a way to trick you.
And if you go further than that, put it an an actual robot instead of a box, there’s nothing stopping it from stealing the approve button and pressing it endlessly. Or just hacking it’s own computer brain and setting reward equal to +INF (after which its behavior in the world is entirely undefined and unpredictable, and possibly dangerous.)
There’s no way to specify “do what I want you to do” as a utility function. Instead we need to come up with clever ways to contain the AI and restrain its power, so we can use it to do useful work.
How does the second AI determine that AlphaGo is within or outside of human inventiveness at that time?
It could look at the existing research on Go playing or neural networks. AlphaGo doesn’t use any radically new methods and was well within the ability of humans. In fact I predicted Go would be beaten by the end of 2015 last year, after reading some papers in 2014 showing really promising results.
Okay, to simplify, suppose the AI has a function …
Boolean humankind_approves(Outcome o)
… that returns 1 when humankind would approve of a particular outcome o, and zero otherwise.
At any given point, the AI needs to have a well specified utility function.
Okay, to simplify, suppose the AI has a function …
Outcome U(Input i)
… which returns the outcome(s) (e.g., answer, plan) that optimizes expected utility given the input i.
But it doesn’t have any reason to care.
Assuming the AI is corrigible (I think we all agree that if the AI is not corrigible, it shouldn’t be turned on), we modify its utility function to U’ where
U’(i) = U(i) when humankind_approves(U(i)) or null if there does not exist a U(i) such that humankind_approves(U(i))
I suggest that an AI with utility function U’ is a friendly AI.
It could look at the existing research
I think extrapolation from existing research is an interesting area of study, but I was attempting to evoke the surprise of a breakthrough invention. To me, the most interesting inventions are exactly those inventions that are not mundane extrapolations of existing techniques.
I just realized I misread your above comment and was arguing against the wrong thing somewhat.
Yes the AI would know what we would approve of. It might also know what we want (note these are different things.) But it doesn’t have any reason to care.
At any given point, the AI needs to have a well specified utility function. Or at least something like a utility function. That gives the AI a goal it can optimize for.
With my method, the AI needs to do several things. It needs to predict what a human judge would do, after reading some output it produces. I.e. if they would hit a big button that says “Approve”. It needs to be able to predict what AI 2 will say after reading its output. I.e. what probability AI 2 will predict AI 1′s output is human. And it needs to predict what actions will lead it towards increasing the probability of those things, and take them. AI 2, in turn, just needs to predict one thing. How likely it’s input was produced by a human.
How do you create a well specified utility function for doing things humans would approve of? You just have it optimize the probability the human will press the button that says “approve”, and ditch the part about it pretending to be human.
But the output most likely to make you hit the approve button isn’t necessarily what you really want! It might be full of lies and manipulation, or a way to trick you.
And if you go further than that, put it an an actual robot instead of a box, there’s nothing stopping it from stealing the approve button and pressing it endlessly. Or just hacking it’s own computer brain and setting
reward
equal to+INF
(after which its behavior in the world is entirely undefined and unpredictable, and possibly dangerous.)There’s no way to specify “do what I want you to do” as a utility function. Instead we need to come up with clever ways to contain the AI and restrain its power, so we can use it to do useful work.
It could look at the existing research on Go playing or neural networks. AlphaGo doesn’t use any radically new methods and was well within the ability of humans. In fact I predicted Go would be beaten by the end of 2015 last year, after reading some papers in 2014 showing really promising results.
Okay, to simplify, suppose the AI has a function …
Boolean humankind_approves(Outcome o)
… that returns 1 when humankind would approve of a particular outcome o, and zero otherwise.
Okay, to simplify, suppose the AI has a function …
Outcome U(Input i)
… which returns the outcome(s) (e.g., answer, plan) that optimizes expected utility given the input i.
Assuming the AI is corrigible (I think we all agree that if the AI is not corrigible, it shouldn’t be turned on), we modify its utility function to U’ where
U’(i) = U(i) when humankind_approves(U(i)) or null if there does not exist a U(i) such that humankind_approves(U(i))
I suggest that an AI with utility function U’ is a friendly AI.
I think extrapolation from existing research is an interesting area of study, but I was attempting to evoke the surprise of a breakthrough invention. To me, the most interesting inventions are exactly those inventions that are not mundane extrapolations of existing techniques.