It’s easy to detect what solutions a human couldn’t have invented. That’s what the second AI does, predict how likely an input was produced by an AI or a human. If it’s very unlikely a human produced it, it can be discarded as “unsafe”.
However it’s hard to know what a human would “approve” of. Since humans can be tricked, manipulated, hacked, intimidated, etc. That is the standard problem with oracles that I am trying to solve with this idea.
It’s easy to detect what solutions a human couldn’t have invented. That’s what the second AI does
I think, to make this detection, the second AI would have to maintain high resolution simulations of the world’s smartest people (if not the entire population), and basically ask the simulations to collaboratively come up with their best solutions to the problem.
Supposing that is the case, the second AI can be configured to maintain high resolution simulations of the entire population, and basically ask the simulations whether they collectively approve of a particular action.
Is there a way to “detect what solutions a human couldn’t have invented” that doesn’t involve emulating humankind?
Emulating human brains is a rather convoluted solution to any problem. The AI would be very familiar with humans and would have a good idea of our abilities.
To give an analogy, imagine we were the superintelligent AIs, and we were trying to tell apart chimps from humans pretending to be chimps. Let’s say say one of the chimps produces a tool as a solution to a problem. Our goal is to guess whether it was really made by a chimp, or a human impersonator.
You look at the tool. It’s a spear made from a sharp rock tied to a stick. You look closely at the cord attaching the rock, and notice it was tied nicely.
You know chimps don’t know anything about knotcraft, let alone making cord, so you reject it as probably made by a human.
Another tool comes to you, a spear made from steel, and you immediately reject it as far beyond the ability of the chimps.
The last tool you examine is just a stick that has been sharpened at the end a little. Not the greatest, but definitely within the ability of chimps to produce. You note that it was probably produced by a chimp and let it pass.
Emulating human brains is a rather convoluted solution to any problem.
Granted. In practice, it may be possible to represent aspects of humankind in a more compact form. But the point is that if …
The AI would be very familiar with humans and would have a good idea of our [inventive] abilities.
… then to me it seems likely that “the AI would be very familiar with humans and would have a good idea of actions that would meet human approval.”
Taking your analogy … if we can model chimp inventiveness to a useful degree, wouldn’t we also be able to model which human actions would earn chimp approval and disapproval? Couldn’t we build a chimp-friendly AI?
Consider a different scenario: a year ago, we asked the first AI to generate a Go playing program that could beat a professional Go player. The first AI submits AlphaGo as its solution after 1 day of processing. How does the second AI determine that AlphaGo is within or outside of human inventiveness at that time?
I just realized I misread your above comment and was arguing against the wrong thing somewhat.
it seems likely that “the AI would be very familiar with humans and would have a good idea of actions that would meet human approval.”
Yes the AI would know what we would approve of. It might also know what we want (note these are different things.) But it doesn’t have any reason to care.
At any given point, the AI needs to have a well specified utility function. Or at least something like a utility function. That gives the AI a goal it can optimize for.
With my method, the AI needs to do several things. It needs to predict what a human judge would do, after reading some output it produces. I.e. if they would hit a big button that says “Approve”. It needs to be able to predict what AI 2 will say after reading its output. I.e. what probability AI 2 will predict AI 1′s output is human. And it needs to predict what actions will lead it towards increasing the probability of those things, and take them. AI 2, in turn, just needs to predict one thing. How likely it’s input was produced by a human.
How do you create a well specified utility function for doing things humans would approve of? You just have it optimize the probability the human will press the button that says “approve”, and ditch the part about it pretending to be human.
But the output most likely to make you hit the approve button isn’t necessarily what you really want! It might be full of lies and manipulation, or a way to trick you.
And if you go further than that, put it an an actual robot instead of a box, there’s nothing stopping it from stealing the approve button and pressing it endlessly. Or just hacking it’s own computer brain and setting reward equal to +INF (after which its behavior in the world is entirely undefined and unpredictable, and possibly dangerous.)
There’s no way to specify “do what I want you to do” as a utility function. Instead we need to come up with clever ways to contain the AI and restrain its power, so we can use it to do useful work.
How does the second AI determine that AlphaGo is within or outside of human inventiveness at that time?
It could look at the existing research on Go playing or neural networks. AlphaGo doesn’t use any radically new methods and was well within the ability of humans. In fact I predicted Go would be beaten by the end of 2015 last year, after reading some papers in 2014 showing really promising results.
Okay, to simplify, suppose the AI has a function …
Boolean humankind_approves(Outcome o)
… that returns 1 when humankind would approve of a particular outcome o, and zero otherwise.
At any given point, the AI needs to have a well specified utility function.
Okay, to simplify, suppose the AI has a function …
Outcome U(Input i)
… which returns the outcome(s) (e.g., answer, plan) that optimizes expected utility given the input i.
But it doesn’t have any reason to care.
Assuming the AI is corrigible (I think we all agree that if the AI is not corrigible, it shouldn’t be turned on), we modify its utility function to U’ where
U’(i) = U(i) when humankind_approves(U(i)) or null if there does not exist a U(i) such that humankind_approves(U(i))
I suggest that an AI with utility function U’ is a friendly AI.
It could look at the existing research
I think extrapolation from existing research is an interesting area of study, but I was attempting to evoke the surprise of a breakthrough invention. To me, the most interesting inventions are exactly those inventions that are not mundane extrapolations of existing techniques.
It’s easy to detect what solutions a human couldn’t have invented. That’s what the second AI does, predict how likely an input was produced by an AI or a human. If it’s very unlikely a human produced it, it can be discarded as “unsafe”.
However it’s hard to know what a human would “approve” of. Since humans can be tricked, manipulated, hacked, intimidated, etc. That is the standard problem with oracles that I am trying to solve with this idea.
I think, to make this detection, the second AI would have to maintain high resolution simulations of the world’s smartest people (if not the entire population), and basically ask the simulations to collaboratively come up with their best solutions to the problem.
Supposing that is the case, the second AI can be configured to maintain high resolution simulations of the entire population, and basically ask the simulations whether they collectively approve of a particular action.
Is there a way to “detect what solutions a human couldn’t have invented” that doesn’t involve emulating humankind?
Emulating human brains is a rather convoluted solution to any problem. The AI would be very familiar with humans and would have a good idea of our abilities.
To give an analogy, imagine we were the superintelligent AIs, and we were trying to tell apart chimps from humans pretending to be chimps. Let’s say say one of the chimps produces a tool as a solution to a problem. Our goal is to guess whether it was really made by a chimp, or a human impersonator.
You look at the tool. It’s a spear made from a sharp rock tied to a stick. You look closely at the cord attaching the rock, and notice it was tied nicely.
You know chimps don’t know anything about knotcraft, let alone making cord, so you reject it as probably made by a human.
Another tool comes to you, a spear made from steel, and you immediately reject it as far beyond the ability of the chimps.
The last tool you examine is just a stick that has been sharpened at the end a little. Not the greatest, but definitely within the ability of chimps to produce. You note that it was probably produced by a chimp and let it pass.
Granted. In practice, it may be possible to represent aspects of humankind in a more compact form. But the point is that if …
… then to me it seems likely that “the AI would be very familiar with humans and would have a good idea of actions that would meet human approval.”
Taking your analogy … if we can model chimp inventiveness to a useful degree, wouldn’t we also be able to model which human actions would earn chimp approval and disapproval? Couldn’t we build a chimp-friendly AI?
Consider a different scenario: a year ago, we asked the first AI to generate a Go playing program that could beat a professional Go player. The first AI submits AlphaGo as its solution after 1 day of processing. How does the second AI determine that AlphaGo is within or outside of human inventiveness at that time?
I just realized I misread your above comment and was arguing against the wrong thing somewhat.
Yes the AI would know what we would approve of. It might also know what we want (note these are different things.) But it doesn’t have any reason to care.
At any given point, the AI needs to have a well specified utility function. Or at least something like a utility function. That gives the AI a goal it can optimize for.
With my method, the AI needs to do several things. It needs to predict what a human judge would do, after reading some output it produces. I.e. if they would hit a big button that says “Approve”. It needs to be able to predict what AI 2 will say after reading its output. I.e. what probability AI 2 will predict AI 1′s output is human. And it needs to predict what actions will lead it towards increasing the probability of those things, and take them. AI 2, in turn, just needs to predict one thing. How likely it’s input was produced by a human.
How do you create a well specified utility function for doing things humans would approve of? You just have it optimize the probability the human will press the button that says “approve”, and ditch the part about it pretending to be human.
But the output most likely to make you hit the approve button isn’t necessarily what you really want! It might be full of lies and manipulation, or a way to trick you.
And if you go further than that, put it an an actual robot instead of a box, there’s nothing stopping it from stealing the approve button and pressing it endlessly. Or just hacking it’s own computer brain and setting
reward
equal to+INF
(after which its behavior in the world is entirely undefined and unpredictable, and possibly dangerous.)There’s no way to specify “do what I want you to do” as a utility function. Instead we need to come up with clever ways to contain the AI and restrain its power, so we can use it to do useful work.
It could look at the existing research on Go playing or neural networks. AlphaGo doesn’t use any radically new methods and was well within the ability of humans. In fact I predicted Go would be beaten by the end of 2015 last year, after reading some papers in 2014 showing really promising results.
Okay, to simplify, suppose the AI has a function …
Boolean humankind_approves(Outcome o)
… that returns 1 when humankind would approve of a particular outcome o, and zero otherwise.
Okay, to simplify, suppose the AI has a function …
Outcome U(Input i)
… which returns the outcome(s) (e.g., answer, plan) that optimizes expected utility given the input i.
Assuming the AI is corrigible (I think we all agree that if the AI is not corrigible, it shouldn’t be turned on), we modify its utility function to U’ where
U’(i) = U(i) when humankind_approves(U(i)) or null if there does not exist a U(i) such that humankind_approves(U(i))
I suggest that an AI with utility function U’ is a friendly AI.
I think extrapolation from existing research is an interesting area of study, but I was attempting to evoke the surprise of a breakthrough invention. To me, the most interesting inventions are exactly those inventions that are not mundane extrapolations of existing techniques.