Make a utility function which will only emit positive values if the AI is disabled at the moment the solution to your precise problem is found. Ensure that the utility function will emit smaller values for solutions which took longer. Ensure the function will emit higher values for world which are more similar to the world as it would have been without the AI interfering.
This will not create friendly AI, but an AI which tries to minimize its interference with the world. Depending on the weights applied to the three parts, it might spontaneously deactivate though.
“AI is disabled” and “world more similar to the world as it would have been without the AI interfering” are both magical categories. Your qualitative ontology has big, block objects labeled “AI” and “world” and an arrow from “AI” to “world” that can be either present or absent. The real world is a borderless, continuous process of quantum fields in which shaking one electron affects another electron on the opposite side of the universe.
I understand the general point, but “AI is disabled” seems like a special case, in that an AI able to do any sort of reasoning about itself, allocate internal resources, etc. (I don’t know how necessary this is for it to do anything useful), will have to have concepts in its qualitative ontology of, or sufficient to define, its disability – though perhaps not in a way easily available for framing a goal system (e.g. if it developed them itself, assuming it could build up to them in their absence), and probably complicated in some other ways that haven’t occurred to me in two minutes.
Suppose the AI builds devices in the environment, especially computational devices designed to offload cognitive labor. What do you want to happen when the AI is “switched off”? Hence, magical category.
Interesting, I didn’t think of this situation. How do you define “lack of intelligence” or “removal of the effect of intelligence” in the environment, so that an AI can implement that state? How is this state best achieved?
Once the system is established, the world will ever be determined by a specific goal system, even if the goal is for the world to appear as if no AI is present, starting from a certain time. The best solution is for AI to pretend of not being present, “pulling the planets along their elliptic orbits”.
As an aside, “waiting for Eliezer to find a loophole” probably does not constitute a safe and effective means of testing AI utility functions. This is something we want provable from first principles, not “proven” by “well, I can’t think of a counterexample”.
Right. I know you realize this, and the post was fine in the context of “random discussion on the internet”. However, if someone wants to actually, seriously specify a utility function for an AI any description that starts with “here’s a high-level rule to avoid bad things” and then works from there looking for potential loopholes is deeply and fundamentally misguided completely independently of the rule proposed.
Player One sets up a hypothetical AI (utility function + restraints), and Player Two, or the rest of the group, must find a situation where a super intelligence with that utility function would lead to a dystopia. Here’s my entry:
1) Utility is obtained by ‘votes’. Every person can either give +1 utility or −1 utility to the AI each day. To avoid issues of uncertainty over what constitutes a person, before the AI is turned on everyone is given a voting button. Additional voting buttons that may be created provide no utility.
2) Strongly discount future votes, using U(t) = 1 / 2^t for the new utility, where t is time in days.This way a constant vote for all time yields a finite utility.
3) The AI can not take any action until one year after it has been turned on, except for text outputs to satisfy (4)
4) The AI has an oracle component. Any voter can ask how their life on a particular future day will vary from the present day. The AI must answer to the best of its abilities.
I ask it a question as in (4), and it tells me “build this machine [specifications included] and you will get [something the AI believes it can tempt me with]”. It exerts the best of its ability in persuading me. Unknown to me, the machine that it wants me to build takes command of all the voting buttons and jams them into giving the AI +1 every day.
I should have been more clear about what I meant by an oracle. The only utility the AI takes into account when answering a question is accuracy. It must be truthful and only answers questions about differences in the days of a voter. A valid question could be something like “A year from now, if you do [horrendous act] would my voter button still vote you down?” If not, I’d vote the AI down for the whole year until it begins acting.
While I recognize the huge threat of unfriendly AI, I’m not convinced Player One can’t win. I’d like to see the game played on a wider scale (and perhaps formalized a bit) to explore the space more thoroughly. It might also help illuminate the risks of AI to people not yet convinced. Plus it’s just fun =D
You can’t get anything useful out of AI directed to leave the future unchanged, as any useful thing it does will (indirectly) make an impact on the future, through its application by people. Trying to define what kind of impact really results from the intended useful thing produced by AI brings you back to the square one.
Minimize is not “reduce to zero”. If the weighting is correct, the optimal outcome might very well be just the solution to your problem and nothing else. Also, this gives you some room for experiments. Start with a function which only values non-interference, and then gradually restart the AI with functions which include ever larger weights for solution finding, until you arrive at the solution.
Start with a function which only values non-interference, and then gradually restart the AI with functions which include ever larger weights for solution finding, until you arrive at the solution.
Stop, and read The Hidden Complexity of Wishes again. To us, killing a person or lobotomizing them feels like a bigger change than (say) moving a pile of rock; but unless your AI already shares your values, you can’t guarantee it will see things the same way.
Your AI would achieve its goal in the first way it finds that matches all the explicit criteria, interpreted without your background assumptions on what make for a ‘reasonable’ interpretation. Unless you’re sure you’ve ruled out every possible “creative” solution that happens to horrify you, this is not a safe plan.
If it can understand “I have had little effect on the world”, it can understand “I am doing good for humanity”. A “safe” utility function would be no easier and less desirable than a Friendly one.
No easier? There’s a lot of hidden content in “effect on the world”, but presumably not all of Fun Theory, the entire definition of “person”, etc. (or shorter descriptions that unfold into these things). An Oracle AI that worked for humans would probably work just as well for Babyeaters or Superhappies (in terms of not automatically destroying things they value; obviously, it’d make alien assumptions about cognitive style, concepts, etc.).
I agree with that much, but the question is whether there’s enough hidden content to force development of a general theory of “learning what the programmers actually meant” that would be sufficient unto full-scale FAI, or sufficient given 20% more work.
Does moving a few ounces of matter from one location to another count as a significant “effect on the world”?
Does it matter to you whether that matter is taken from 1) a vital component of the detonator on a bomb in a densely populated area or 2) the frontal lobe of your brain?
If it does matter to you, how do you propose to explain the difference to an AI?
Does moving a few ounces of matter from one location to another count as a significant “effect on the world”?
In general, yes; you can and should be much more conservative here than would fully reflect your preferences, and give it a principle implying your (1) and (2) are both Very Bad.
But, the waste heat from its computation will move at least a few ounces of air.
Maybe you can get around this by having it not worry (so to speak) about effects other than through I/O, but this is unsafe if it can use channels you didn’t think of to deliberately influence the world. Certainly other problems, too – but (it seems to me) problems that have to be solved anyway to implement CEV, which is sort of a special case of Oracle AI.
But, the waste heat from its computation will move at least a few ounces of air.
Quite so. The waste heat, of course, has very little thermodynamically significant direct impact on the rest of the world—but by the same token, removing someone’s frontal lobe or not has a smaller, more indirect impact on the world than preventing the bomb from detonating or not.
Now, suppose the AI’s grasp of causal structure is sufficient that it will indeed only take actions that truly have minimal impact vs. nonaction; in this case it will be unable to communicate with humans in ways that are expected to result in significant changes to the human’s future behavior, making it a singularly useless oracle.
My intuition here is that the insights required for any specification of what causal results of action are acceptable is roughly equivalent to what is necessary to specify something like CEV (i.e., essentially what Warrigal said above) in that both require the AI have, roughly speaking, the ability to figure out what people actually want, not what they say they want. If you’ve done it right, you don’t need additional safeguards such as preventing significant effects; if you’ve done it wrong, you’re probably screwed anyways.
That’s pretty condensed. One of my video/essays discusses the underlying idea. To quote:
“One thing that might help is to put the agent into a quiescent state before being switched off. In the quiescent state, utility depends on not taking any of its previous utility-producing actions. This helps to motivate the machine to ensure subcontractors and minions can be told to cease and desist. If the agent is doing nothing when it is switched off, hopefully, it will continue to do nothing.
Problems with the agent’s sense of identity can be partly addressed by making sure that it has a good sense of identity. If it makes minions, it should count them as somatic tissue, and ensure they are switched off as well. Subcontractors should not be “switched off”—but should be tracked and told to desist—and so on.”
I haven’t thought this through very deeply, but couldn’t the working of the machine be bounded by hard safety constraints that the AI was not allowed to change, rather than trying to work safety into the overall utility function?
e.g. the AI is not allowed to construct more resources for itself. No matter how inefficient it may be, the AI has to ask a human for more hardware and wait for the hardware to be installed by humans.
What I would want of a super-intellient AI is more or less what I would want of a human who has power over me: don’t do things to me or my stuff without asking, don’t coerce me, don’t lie to me.
I don’t know how you would code all that, but if we can’t design simple constraints, we can’t correctly design more complex ones. I’m thinking layers of simple constraints would be safer than one unprovably-friendly utility function.
There’s been a great deal of discussion here on such issues already; you might be interested in some of what’s been said. As for your initial point, I think you have a model of a ghost in the machine who you need to force to do your bidding; that’s not quite what happens.
Nope, I’m a software engineer, so I don’t have that particular magical model of how computer systems work.
But suppose you design even a conventional software system, to run something less dangerous than a general AI, like say a nuclear reactor. Would you have every valve and mechanism controlled only by the software, with no mechanical failsafes or manual overrides, trusting the software to have no deadly flaws?
Designing that software and proving that it would work correctly in every possible circumstance would be a rich and interesting research topic, but never be completed.
One difference with AI is that it is theoretically capable of analyzing your failsafes and overrides (and their associated hidden flaws) more thoroughly than you. Manual, physical overrides aren’t yet amenable to rigorous, formal analysis, but software is. If we employ a logic to prove constraints on the AI’s behavior, the AI shouldn’t be able to violate its constraints without basically exploiting an inconsistency in the logic, which seems far less likely than the case where, e.g., it finds a bug in the overrides or tricks the humans into sabotaging them.
How to design utility functions for safe AIs?
Make a utility function which will only emit positive values if the AI is disabled at the moment the solution to your precise problem is found. Ensure that the utility function will emit smaller values for solutions which took longer. Ensure the function will emit higher values for world which are more similar to the world as it would have been without the AI interfering.
This will not create friendly AI, but an AI which tries to minimize its interference with the world. Depending on the weights applied to the three parts, it might spontaneously deactivate though.
“AI is disabled” and “world more similar to the world as it would have been without the AI interfering” are both magical categories. Your qualitative ontology has big, block objects labeled “AI” and “world” and an arrow from “AI” to “world” that can be either present or absent. The real world is a borderless, continuous process of quantum fields in which shaking one electron affects another electron on the opposite side of the universe.
I understand the general point, but “AI is disabled” seems like a special case, in that an AI able to do any sort of reasoning about itself, allocate internal resources, etc. (I don’t know how necessary this is for it to do anything useful), will have to have concepts in its qualitative ontology of, or sufficient to define, its disability – though perhaps not in a way easily available for framing a goal system (e.g. if it developed them itself, assuming it could build up to them in their absence), and probably complicated in some other ways that haven’t occurred to me in two minutes.
Suppose the AI builds devices in the environment, especially computational devices designed to offload cognitive labor. What do you want to happen when the AI is “switched off”? Hence, magical category.
Interesting, I didn’t think of this situation. How do you define “lack of intelligence” or “removal of the effect of intelligence” in the environment, so that an AI can implement that state? How is this state best achieved?
Once the system is established, the world will ever be determined by a specific goal system, even if the goal is for the world to appear as if no AI is present, starting from a certain time. The best solution is for AI to pretend of not being present, “pulling the planets along their elliptic orbits”.
D’oh. Yes, of course, that breaks it.
As an aside, “waiting for Eliezer to find a loophole” probably does not constitute a safe and effective means of testing AI utility functions. This is something we want provable from first principles, not “proven” by “well, I can’t think of a counterexample”.
Of course, hence ”...and probably complicated in some other ways that haven’t occurred to me in two minutes.”.
Right. I know you realize this, and the post was fine in the context of “random discussion on the internet”. However, if someone wants to actually, seriously specify a utility function for an AI any description that starts with “here’s a high-level rule to avoid bad things” and then works from there looking for potential loopholes is deeply and fundamentally misguided completely independently of the rule proposed.
This might make a fun game actually.
Player One sets up a hypothetical AI (utility function + restraints), and Player Two, or the rest of the group, must find a situation where a super intelligence with that utility function would lead to a dystopia. Here’s my entry:
1) Utility is obtained by ‘votes’. Every person can either give +1 utility or −1 utility to the AI each day. To avoid issues of uncertainty over what constitutes a person, before the AI is turned on everyone is given a voting button. Additional voting buttons that may be created provide no utility.
2) Strongly discount future votes, using U(t) = 1 / 2^t for the new utility, where t is time in days.This way a constant vote for all time yields a finite utility.
3) The AI can not take any action until one year after it has been turned on, except for text outputs to satisfy (4)
4) The AI has an oracle component. Any voter can ask how their life on a particular future day will vary from the present day. The AI must answer to the best of its abilities.
I ask it a question as in (4), and it tells me “build this machine [specifications included] and you will get [something the AI believes it can tempt me with]”. It exerts the best of its ability in persuading me. Unknown to me, the machine that it wants me to build takes command of all the voting buttons and jams them into giving the AI +1 every day.
I don’t think Player One can win this game.
I should have been more clear about what I meant by an oracle. The only utility the AI takes into account when answering a question is accuracy. It must be truthful and only answers questions about differences in the days of a voter. A valid question could be something like “A year from now, if you do [horrendous act] would my voter button still vote you down?” If not, I’d vote the AI down for the whole year until it begins acting.
While I recognize the huge threat of unfriendly AI, I’m not convinced Player One can’t win. I’d like to see the game played on a wider scale (and perhaps formalized a bit) to explore the space more thoroughly. It might also help illuminate the risks of AI to people not yet convinced. Plus it’s just fun =D
You can’t get anything useful out of AI directed to leave the future unchanged, as any useful thing it does will (indirectly) make an impact on the future, through its application by people. Trying to define what kind of impact really results from the intended useful thing produced by AI brings you back to the square one.
Minimize is not “reduce to zero”. If the weighting is correct, the optimal outcome might very well be just the solution to your problem and nothing else. Also, this gives you some room for experiments. Start with a function which only values non-interference, and then gradually restart the AI with functions which include ever larger weights for solution finding, until you arrive at the solution.
Or until everyone is dead.
If the solution to your problem is only reachable by killing everybody, yes.
Um, then this is not a “safe” AI in any reasonable sense.
I claim that it is, as it is averse to killing people as a side effect. If your solution does not require killing people it would not.
Stop, and read The Hidden Complexity of Wishes again. To us, killing a person or lobotomizing them feels like a bigger change than (say) moving a pile of rock; but unless your AI already shares your values, you can’t guarantee it will see things the same way.
Your AI would achieve its goal in the first way it finds that matches all the explicit criteria, interpreted without your background assumptions on what make for a ‘reasonable’ interpretation. Unless you’re sure you’ve ruled out every possible “creative” solution that happens to horrify you, this is not a safe plan.
If it can understand “I have had little effect on the world”, it can understand “I am doing good for humanity”. A “safe” utility function would be no easier and less desirable than a Friendly one.
No easier? There’s a lot of hidden content in “effect on the world”, but presumably not all of Fun Theory, the entire definition of “person”, etc. (or shorter descriptions that unfold into these things). An Oracle AI that worked for humans would probably work just as well for Babyeaters or Superhappies (in terms of not automatically destroying things they value; obviously, it’d make alien assumptions about cognitive style, concepts, etc.).
I agree with that much, but the question is whether there’s enough hidden content to force development of a general theory of “learning what the programmers actually meant” that would be sufficient unto full-scale FAI, or sufficient given 20% more work.
Does moving a few ounces of matter from one location to another count as a significant “effect on the world”?
Does it matter to you whether that matter is taken from 1) a vital component of the detonator on a bomb in a densely populated area or 2) the frontal lobe of your brain?
If it does matter to you, how do you propose to explain the difference to an AI?
In general, yes; you can and should be much more conservative here than would fully reflect your preferences, and give it a principle implying your (1) and (2) are both Very Bad.
But, the waste heat from its computation will move at least a few ounces of air.
Maybe you can get around this by having it not worry (so to speak) about effects other than through I/O, but this is unsafe if it can use channels you didn’t think of to deliberately influence the world. Certainly other problems, too – but (it seems to me) problems that have to be solved anyway to implement CEV, which is sort of a special case of Oracle AI.
Quite so. The waste heat, of course, has very little thermodynamically significant direct impact on the rest of the world—but by the same token, removing someone’s frontal lobe or not has a smaller, more indirect impact on the world than preventing the bomb from detonating or not.
Now, suppose the AI’s grasp of causal structure is sufficient that it will indeed only take actions that truly have minimal impact vs. nonaction; in this case it will be unable to communicate with humans in ways that are expected to result in significant changes to the human’s future behavior, making it a singularly useless oracle.
My intuition here is that the insights required for any specification of what causal results of action are acceptable is roughly equivalent to what is necessary to specify something like CEV (i.e., essentially what Warrigal said above) in that both require the AI have, roughly speaking, the ability to figure out what people actually want, not what they say they want. If you’ve done it right, you don’t need additional safeguards such as preventing significant effects; if you’ve done it wrong, you’re probably screwed anyways.
That’s pretty condensed. One of my video/essays discusses the underlying idea. To quote:
“One thing that might help is to put the agent into a quiescent state before being switched off. In the quiescent state, utility depends on not taking any of its previous utility-producing actions. This helps to motivate the machine to ensure subcontractors and minions can be told to cease and desist. If the agent is doing nothing when it is switched off, hopefully, it will continue to do nothing.
Problems with the agent’s sense of identity can be partly addressed by making sure that it has a good sense of identity. If it makes minions, it should count them as somatic tissue, and ensure they are switched off as well. Subcontractors should not be “switched off”—but should be tracked and told to desist—and so on.”
http://alife.co.uk/essays/stopping_superintelligence/
I haven’t thought this through very deeply, but couldn’t the working of the machine be bounded by hard safety constraints that the AI was not allowed to change, rather than trying to work safety into the overall utility function?
e.g. the AI is not allowed to construct more resources for itself. No matter how inefficient it may be, the AI has to ask a human for more hardware and wait for the hardware to be installed by humans.
What I would want of a super-intellient AI is more or less what I would want of a human who has power over me: don’t do things to me or my stuff without asking, don’t coerce me, don’t lie to me.
I don’t know how you would code all that, but if we can’t design simple constraints, we can’t correctly design more complex ones. I’m thinking layers of simple constraints would be safer than one unprovably-friendly utility function.
There’s been a great deal of discussion here on such issues already; you might be interested in some of what’s been said. As for your initial point, I think you have a model of a ghost in the machine who you need to force to do your bidding; that’s not quite what happens.
Nope, I’m a software engineer, so I don’t have that particular magical model of how computer systems work.
But suppose you design even a conventional software system, to run something less dangerous than a general AI, like say a nuclear reactor. Would you have every valve and mechanism controlled only by the software, with no mechanical failsafes or manual overrides, trusting the software to have no deadly flaws?
Designing that software and proving that it would work correctly in every possible circumstance would be a rich and interesting research topic, but never be completed.
One difference with AI is that it is theoretically capable of analyzing your failsafes and overrides (and their associated hidden flaws) more thoroughly than you. Manual, physical overrides aren’t yet amenable to rigorous, formal analysis, but software is. If we employ a logic to prove constraints on the AI’s behavior, the AI shouldn’t be able to violate its constraints without basically exploiting an inconsistency in the logic, which seems far less likely than the case where, e.g., it finds a bug in the overrides or tricks the humans into sabotaging them.
What makes you think these constraints are at all simple?