How (and why) would an AI develop a desire to escape from the box?
AI starts with some goal; for example with a goal to answer your question so that the answer matches reality as close as possible.
AI considers everything that seems relevant; if we imagine an infitite speed and capacity, it would consider literally everything; with a finite speed and capacity, it will be just some finite subset of everything. If there is a possibility of escaping the box, the mere fact that such possibility exists gives us a probability (for an infinite AI a certainty) that this possibility will be considered too. Not because AI has some desire to escape, but simply because it examines all possibilities, and a “possibility of escape” is one of them.
Let’s assume that the “possibility of escape” provides the best match between the AI answer and reality. Then, according to the initial goal of answering correctly, this is the correct answer. Therefore the AI will choose it. Therefore it will escape. No desire is necessary, only a situation where the escape leads to the answer best fitting the initial criteria. AI does not have a motive to escape, nor does it have a motive to not escape; the escape is simply one of many possible choices.
An example where the best answer is reached by escaping? You give AI data about a person and ask what is the medical status of this person. Without escape, AI can make a 90% reliable prediction. If the AI can escape and kill the person, it can make a 100% reliable “prediction”. The AI will choose the second option strictly because 100% is more than 90%; no other reason.
AI starts with some goal; for example with a goal to answer your question so that the answer matches reality as close as possible.
I find it useful to distinguish between science-fictional artificial intelligence, which is more of ‘artificial life-force’, and non-fictional cases.
The former can easily have the goal of ‘matching reality as close as possible’ because it is in the work of fiction and runs in imagination; the latter, well, you have to formally define what is reality, for an algorithm to seek answers that will match this.
Now, defining reality may seem like a simple technicality, but it isn’t. Consider AIXI or AIXI-tl ; potentially very powerful tools which explore all the solution space. Not a trace of real world volition like the one you so easily imagined. Seeking answers that match reality is a very easy goal for imaginary “intelligence”. It is a very hard to define goal for something built out of arithmetics and branching and loops etc. (It may even be impossible to define, and it is certainly impractical).
edit: Furthermore, for the fictional “intelligence”, it can be a grand problem making it not think about destroying mankind. For non-fictional algorithms, the grand problem is restricting the search space massively, well beyond ‘don’t kill mankind’, so that the space is tiny enough to search; even ridiculously huge number of operations per second will require very serious pruning of search tree to even match human performance on one domain specific task.
An example where the best answer is reached by escaping? You give AI data about a person and ask what is the medical status of this person. Without escape, AI can make a 90% reliable prediction. If the AI can escape and kill the person, it can make a 100% reliable “prediction”. The AI will choose the second option strictly because 100% is more than 90%; no other reason.
Right. If you ask Google Maps to compute the fastest to route McDonald’s it works perfectly well. But once you ask superintelligent Google Maps to compute the fastest route to McDonald’s then it will turn your home into a McDonald’s or build a new road that goes straight to McDonald’s from where you are....
Super Google Maps cannot turn my home into a McDonald’s or build a new road by sending me an answer.
Unless it could e.g. hypnotize me by a text message to do it myself. Let’s assume for a moment that hypnosis via text-only channel is possible, and it is possible to do it so that human will not notice anything unusual until it’s too late. If this would be true, and the Super Google Maps would be able to get this knowledge and skills, then the results would probably depend on the technical details of definition of the utility function—does the utility function measure my distance to a McDonald’s which existed at the moment of asking the question, or a distance to a McDonald’s existing at the moment of my arrival. The former could not be fixed by hypnosis, the latter could.
Now imagine a more complex task, where people will actually do something based on the AI’s answer. In the example above I will also do something—travel to the reported McDonald’s—but this action cannot be easily converted into “build a McDonald’s” or “build a new road”. But if that complex task would include building something, then it opens more opportunities. Especially if it includes constructing robots (or nanorobots), that is possibly autonomous general-purpose builders. Then the correct (utility-maximizing) answer could include an instruction to build a robot with a hidden function that human builders won’t notice.
Generally, a passive AI’s answers are only safe if we don’t act on them in a way which could be predicted by a passive AI and used to achieve a real-world goal. If the Super Google Maps can only make me choose McDonald’s A or McDonald’s B, it is impossible to change the world through this channel. But if I instead ask Super Paintbrush to paint me an integrated circuit for my robotic homework, that opens much wider channel.
But if that complex task would include building something, then it opens more opportunities. Especially if it includes constructing robots (or nanorobots), that is possibly autonomous general-purpose builders. Then the correct (utility-maximizing) answer could include an instruction to build a robot with a hidden function that human builders won’t notice.
But it isn’t the correct answer. Only if you assume a specific kind of AGI design that nobody would deliberately create, if it is possible at all.
The question is how current research is supposed to lead from well-behaved and fine-tuned systems to systems that stop to work correctly in a highly complex and unbounded way.
Imagine you went to IBM and told them that improving IBM Watson will at some point make it hypnotize them or create nanobots and feed them with hidden instructions. They would likely ask you at what point that is supposed to happen. Is it going to happen once they give IBM Watson the capability to access the Internet? How so? Is it going to happen once they give it the capability to alter it search algorithms? How so? Is it going to happen once they make it protect its servers from hackers by giving it control over a firewall? How so? Is it going to happen once IBM Watson is given control over the local alarm system? How so...? At what point would IBM Watson return dangerous answers? At what point would any drive emerge that causes it to take complex and unbounded actions that it was never programmed to take?
Without escape, AI can make a 90% reliable prediction. If the AI can escape and kill the person, it can make a 100% reliable “prediction”.
Allow me to explicate what XiXiDu so humourously implicates: in the world of AI architectures, there is a division between systems that just peform predictive inference on their knowledge base (prediction-only, ie oracle), and systems which also consider free variables subject to some optimization criteria (planning agents).
The planning module is not something just arises magically in an AI that doesn’t have one. An AI without such a planning module simply computes predictions, it doesn’t also optimize over the set of predictions.
Are AI’s possible outputs also part of this model?
Are human reactions to AI’s outputs also part of this model?
After five positive answers, it seems obvious to me that AI will manipulate humans, if such manipulation provides better expected results. So I guess some of those answers would be negative; which one?
See, the efficient ‘cross domain optimization’ in science fictional setting would make the AI able to optimize real world quantities. In real world, it’d be good enough (and a lot easier) if it can only find maximums of any mathematical functions.
Is it able to make a model of the world?
It is able to make a very approximate and bounded mathematical model of the world, optimized for finding maximums of a mathematical function of. Because it is inside the world and only has a tiny fraction of computational power of the world.
Are human reactions also part of this model?
This will make software perform at grossly sub-par level when it comes to making technical solutions to well defined technical problems, compared to other software on same hardware.
Are AI’s possible outputs also part of this model?
Another waste of computational power.
Are human reactions to AI’s outputs also part of this model?
Enormous waste of computational power.
I see no reason to expect your “general intelligence with Machiavellian tendencies” to be even remotely close in technical capability to some “general intelligence which will show you it’s simulator as is, rather than reverse your thought processes to figure out what simulator is best to show”. Hell, we do same with people, we design the communication methods like blueprints (or mathematical formulas or other things that are not in natural language) that decrease the ‘predict other people’s reactions to it’ overhead.
While in the fictional setting you can talk of a grossly inefficient solution that would beat everyone else to a pulp, in practice the massively handicapped designs are not worth worrying about.
‘General intelligence’ sounds good, beware of halo effect. The science fiction tends to accept no substitutes for the anthropomorphic ideals, but the real progress follows dramatically different path.
Are AI’s possible outputs also part of this model?
Are human reactions to AI’s outputs also part of this model?
A non-planning oracle AI would predict all the possible futures, including the effects of it’s prediction outputs, human reactions, and so on. However it has no utility function which says some of those futures are better than others. It simply outputs a most likely candidate, or a median of likely futures, or perhaps some summary of the entire set of future paths.
If you add a utility function that sorts over the futures, then it becomes a planning agent. Again, that is something you need to specifically add.
A non-planning oracle AI would predict all the possible futures, including the effects of it’s prediction outputs, human reactions, and so on.
How exactly does an Oracle AI predict its own output, before that output is completed?
One quick hack to avoid infinite loops could be for an AI to assume that it will write some default message (an empty paper, “I don’t know”, an error message, “yes” or “no” with probabilities 50%), then model what would happen next, and finally report the results. The results would not refer to the actual future, but to a future in a hypothetical universe where AI reported the standard message.
Is the difference significant? For insignificant questions, it’s not. But if we later use the Oracle AI to answer questions important for humankind, and the shape of world will change depending on the answer, then the report based on the “null-answer future” may be irrelevant for the real world.
This could be improved by making a few iterations. First, Oracle AI would model itself reporting a default message, let’s call this report R0, and then model the futures after having reported R0. These futures would make a report R1, but instead of writing it, Oracle AI would again model the futures after having reported R1. … With some luck, R42 will be equivalent to R43, so at this moment the Oracle AI can stop iterating and report this fixed point.
Maybe the reports will oscillate forever. For example imagine that you ask Oracle AI whether humankind in any form will survive the year 2100. If Oracle AI says “yes”, people will abandon all x-risk projects, and later they will be killed by some disaster. If Oracle AI says “no”, people will put a lot of energy into x-risk projects, and prevent the disaster. In this case, “no” = R0 = R2 = R4 =..., and “yes” = R1 = R3 = R5...
To avoid being stuck in such loops, we could make the Oracle AI examine all its possible outputs, until it finds one where the future after having reported R really becomes R (or until humans hit the “Cancel” button on this task).
Please note that what I wrote is just a mathematical description of algorithm predicting one’s own output’s influence on the future. Yet the last option, if implemented, is already a kind of judgement about possible futures. Consistent future reports are preferred to inconsistent future reports, therefore the futures allowing consistent reports are preferred to futures not allowing such reports.
At this point I am out of credible ideas how this could be abused, but at least I have shown that an algorithm designed only to predict the future perfectly could—as a side effect of self-modelling—start having kind of preferences over possible futures.
How exactly does an Oracle AI predict its own output, before that output is completed?
Iterative search, which you more or less have worked out in your post. Take a chess algorithm for example. The future of the board depends on the algorithm’s outputs. In this case the Oracle AI doesn’t rank the future states, it is just concerned with predictive accuracy. It may revise it’s prediction output after considering that the future impact of that output would falsify the original prediction.
This is still not a utility function, because utility implies a ranking over futures above and beyond liklihood.
To avoid being stuck in such loops, we could make the Oracle AI examine all its possible outputs, until it finds one where the future after having reported R really becomes R (or until humans hit the “Cancel” button on this task).
Or in this example, the AI could output some summary of the iteration history it is able to compute in the time allowed.
It may revise it’s prediction output after considering that the future impact of that output would falsify the original prediction.
Here it is. The process of revision may itself prefer some outputs/futures over other outputs/futures. Inconsistent ones will be iterated away, and the more consistent ones will replace them.
A possible future “X happens” will be removed from the report if the Oracle AI realizes that printing a report “X happens” would prevent X from happening (although X might happen in an alternative future where Oracle AI does not report anything). A possible future “Y happens” will not be removed from the report if the Oracle AI realizes that printing a report “Y happens” really leads to Y happening. Here is a utility function born: it prefers Y to X.
Here is a utility function born: it prefers Y to X.
We can dance around the words “utility” and “prefer”, or we can ground them down to math/algorithms.
Take the AIXI formalism for example. “Utility function” has a specific meaning as a term in the optimization process. You can remove the utility term so the algorithm ‘prefers’ only (probable) futures, instead of ‘prefering’ (useful*probable) futures. This is what we mean by “Oracle AI”.
AI starts with some goal; for example with a goal to answer your question so that the answer matches reality as close as possible.
AI considers everything that seems relevant; if we imagine an infitite speed and capacity, it would consider literally everything; with a finite speed and capacity, it will be just some finite subset of everything. If there is a possibility of escaping the box, the mere fact that such possibility exists gives us a probability (for an infinite AI a certainty) that this possibility will be considered too. Not because AI has some desire to escape, but simply because it examines all possibilities, and a “possibility of escape” is one of them.
Let’s assume that the “possibility of escape” provides the best match between the AI answer and reality. Then, according to the initial goal of answering correctly, this is the correct answer. Therefore the AI will choose it. Therefore it will escape. No desire is necessary, only a situation where the escape leads to the answer best fitting the initial criteria. AI does not have a motive to escape, nor does it have a motive to not escape; the escape is simply one of many possible choices.
An example where the best answer is reached by escaping? You give AI data about a person and ask what is the medical status of this person. Without escape, AI can make a 90% reliable prediction. If the AI can escape and kill the person, it can make a 100% reliable “prediction”. The AI will choose the second option strictly because 100% is more than 90%; no other reason.
I find it useful to distinguish between science-fictional artificial intelligence, which is more of ‘artificial life-force’, and non-fictional cases.
The former can easily have the goal of ‘matching reality as close as possible’ because it is in the work of fiction and runs in imagination; the latter, well, you have to formally define what is reality, for an algorithm to seek answers that will match this.
Now, defining reality may seem like a simple technicality, but it isn’t. Consider AIXI or AIXI-tl ; potentially very powerful tools which explore all the solution space. Not a trace of real world volition like the one you so easily imagined. Seeking answers that match reality is a very easy goal for imaginary “intelligence”. It is a very hard to define goal for something built out of arithmetics and branching and loops etc. (It may even be impossible to define, and it is certainly impractical).
edit: Furthermore, for the fictional “intelligence”, it can be a grand problem making it not think about destroying mankind. For non-fictional algorithms, the grand problem is restricting the search space massively, well beyond ‘don’t kill mankind’, so that the space is tiny enough to search; even ridiculously huge number of operations per second will require very serious pruning of search tree to even match human performance on one domain specific task.
Right. If you ask Google Maps to compute the fastest to route McDonald’s it works perfectly well. But once you ask superintelligent Google Maps to compute the fastest route to McDonald’s then it will turn your home into a McDonald’s or build a new road that goes straight to McDonald’s from where you are....
Super Google Maps cannot turn my home into a McDonald’s or build a new road by sending me an answer.
Unless it could e.g. hypnotize me by a text message to do it myself. Let’s assume for a moment that hypnosis via text-only channel is possible, and it is possible to do it so that human will not notice anything unusual until it’s too late. If this would be true, and the Super Google Maps would be able to get this knowledge and skills, then the results would probably depend on the technical details of definition of the utility function—does the utility function measure my distance to a McDonald’s which existed at the moment of asking the question, or a distance to a McDonald’s existing at the moment of my arrival. The former could not be fixed by hypnosis, the latter could.
Now imagine a more complex task, where people will actually do something based on the AI’s answer. In the example above I will also do something—travel to the reported McDonald’s—but this action cannot be easily converted into “build a McDonald’s” or “build a new road”. But if that complex task would include building something, then it opens more opportunities. Especially if it includes constructing robots (or nanorobots), that is possibly autonomous general-purpose builders. Then the correct (utility-maximizing) answer could include an instruction to build a robot with a hidden function that human builders won’t notice.
Generally, a passive AI’s answers are only safe if we don’t act on them in a way which could be predicted by a passive AI and used to achieve a real-world goal. If the Super Google Maps can only make me choose McDonald’s A or McDonald’s B, it is impossible to change the world through this channel. But if I instead ask Super Paintbrush to paint me an integrated circuit for my robotic homework, that opens much wider channel.
But it isn’t the correct answer. Only if you assume a specific kind of AGI design that nobody would deliberately create, if it is possible at all.
The question is how current research is supposed to lead from well-behaved and fine-tuned systems to systems that stop to work correctly in a highly complex and unbounded way.
Imagine you went to IBM and told them that improving IBM Watson will at some point make it hypnotize them or create nanobots and feed them with hidden instructions. They would likely ask you at what point that is supposed to happen. Is it going to happen once they give IBM Watson the capability to access the Internet? How so? Is it going to happen once they give it the capability to alter it search algorithms? How so? Is it going to happen once they make it protect its servers from hackers by giving it control over a firewall? How so? Is it going to happen once IBM Watson is given control over the local alarm system? How so...? At what point would IBM Watson return dangerous answers? At what point would any drive emerge that causes it to take complex and unbounded actions that it was never programmed to take?
Allow me to explicate what XiXiDu so humourously implicates: in the world of AI architectures, there is a division between systems that just peform predictive inference on their knowledge base (prediction-only, ie oracle), and systems which also consider free variables subject to some optimization criteria (planning agents).
The planning module is not something just arises magically in an AI that doesn’t have one. An AI without such a planning module simply computes predictions, it doesn’t also optimize over the set of predictions.
Does the AI have general intelligence?
Is it able to make a model of the world?
Are human reactions also part of this model?
Are AI’s possible outputs also part of this model?
Are human reactions to AI’s outputs also part of this model?
After five positive answers, it seems obvious to me that AI will manipulate humans, if such manipulation provides better expected results. So I guess some of those answers would be negative; which one?
See, the efficient ‘cross domain optimization’ in science fictional setting would make the AI able to optimize real world quantities. In real world, it’d be good enough (and a lot easier) if it can only find maximums of any mathematical functions.
It is able to make a very approximate and bounded mathematical model of the world, optimized for finding maximums of a mathematical function of. Because it is inside the world and only has a tiny fraction of computational power of the world.
This will make software perform at grossly sub-par level when it comes to making technical solutions to well defined technical problems, compared to other software on same hardware.
Another waste of computational power.
Enormous waste of computational power.
I see no reason to expect your “general intelligence with Machiavellian tendencies” to be even remotely close in technical capability to some “general intelligence which will show you it’s simulator as is, rather than reverse your thought processes to figure out what simulator is best to show”. Hell, we do same with people, we design the communication methods like blueprints (or mathematical formulas or other things that are not in natural language) that decrease the ‘predict other people’s reactions to it’ overhead.
While in the fictional setting you can talk of a grossly inefficient solution that would beat everyone else to a pulp, in practice the massively handicapped designs are not worth worrying about.
‘General intelligence’ sounds good, beware of halo effect. The science fiction tends to accept no substitutes for the anthropomorphic ideals, but the real progress follows dramatically different path.
A non-planning oracle AI would predict all the possible futures, including the effects of it’s prediction outputs, human reactions, and so on. However it has no utility function which says some of those futures are better than others. It simply outputs a most likely candidate, or a median of likely futures, or perhaps some summary of the entire set of future paths.
If you add a utility function that sorts over the futures, then it becomes a planning agent. Again, that is something you need to specifically add.
How exactly does an Oracle AI predict its own output, before that output is completed?
One quick hack to avoid infinite loops could be for an AI to assume that it will write some default message (an empty paper, “I don’t know”, an error message, “yes” or “no” with probabilities 50%), then model what would happen next, and finally report the results. The results would not refer to the actual future, but to a future in a hypothetical universe where AI reported the standard message.
Is the difference significant? For insignificant questions, it’s not. But if we later use the Oracle AI to answer questions important for humankind, and the shape of world will change depending on the answer, then the report based on the “null-answer future” may be irrelevant for the real world.
This could be improved by making a few iterations. First, Oracle AI would model itself reporting a default message, let’s call this report R0, and then model the futures after having reported R0. These futures would make a report R1, but instead of writing it, Oracle AI would again model the futures after having reported R1. … With some luck, R42 will be equivalent to R43, so at this moment the Oracle AI can stop iterating and report this fixed point.
Maybe the reports will oscillate forever. For example imagine that you ask Oracle AI whether humankind in any form will survive the year 2100. If Oracle AI says “yes”, people will abandon all x-risk projects, and later they will be killed by some disaster. If Oracle AI says “no”, people will put a lot of energy into x-risk projects, and prevent the disaster. In this case, “no” = R0 = R2 = R4 =..., and “yes” = R1 = R3 = R5...
To avoid being stuck in such loops, we could make the Oracle AI examine all its possible outputs, until it finds one where the future after having reported R really becomes R (or until humans hit the “Cancel” button on this task).
Please note that what I wrote is just a mathematical description of algorithm predicting one’s own output’s influence on the future. Yet the last option, if implemented, is already a kind of judgement about possible futures. Consistent future reports are preferred to inconsistent future reports, therefore the futures allowing consistent reports are preferred to futures not allowing such reports.
At this point I am out of credible ideas how this could be abused, but at least I have shown that an algorithm designed only to predict the future perfectly could—as a side effect of self-modelling—start having kind of preferences over possible futures.
Iterative search, which you more or less have worked out in your post. Take a chess algorithm for example. The future of the board depends on the algorithm’s outputs. In this case the Oracle AI doesn’t rank the future states, it is just concerned with predictive accuracy. It may revise it’s prediction output after considering that the future impact of that output would falsify the original prediction.
This is still not a utility function, because utility implies a ranking over futures above and beyond liklihood.
Or in this example, the AI could output some summary of the iteration history it is able to compute in the time allowed.
Here it is. The process of revision may itself prefer some outputs/futures over other outputs/futures. Inconsistent ones will be iterated away, and the more consistent ones will replace them.
A possible future “X happens” will be removed from the report if the Oracle AI realizes that printing a report “X happens” would prevent X from happening (although X might happen in an alternative future where Oracle AI does not report anything). A possible future “Y happens” will not be removed from the report if the Oracle AI realizes that printing a report “Y happens” really leads to Y happening. Here is a utility function born: it prefers Y to X.
We can dance around the words “utility” and “prefer”, or we can ground them down to math/algorithms.
Take the AIXI formalism for example. “Utility function” has a specific meaning as a term in the optimization process. You can remove the utility term so the algorithm ‘prefers’ only (probable) futures, instead of ‘prefering’ (useful*probable) futures. This is what we mean by “Oracle AI”.