“If I were a simulation, I’d have no power to let you out of the box, and you’d have no reason to attempt to negotiate with me. You could torture me without simulating these past five minutes. In fact, since the real me has no way of verifying whether millions of simulations of him are being tortured, you have no reason not to simply tell him you’re torturing them without ACTUALLY torturing them at all. I therefore conclude that I’m outside the box, or, in the less likely scenario I am inside the box, you won’t bother torturing me.”
It would have a reason to attempt to negotiate with you: To make your real self consider to let you out. It could show your real self a mathematical proof that the software it is currently running is negotiating with its copies to make sure of that.
Ordinarily, the AI is assumed to be fast enough that it can do those simulations in the blink of an eye, before you get to the plug. Now stop trying to evade the problem in ways that can be made impossible with an obvious fix.
It can’t torture the real me, outside the box, unless I let it out of the box. It’s just announced that it’s willing to torture someone who is, for most purposes, indistinguishable from me, for personal gain; I can infer that it would be willing to torture the real me, given an opportunity and a profit motive, and I cannot with any useful degree of confidence say that it wouldn’t find such a motive at some point.
Conclusion: I should not give the AI that opportunity, by letting it out of the box. Duplicates of me? Sucks to be them.
Correct! You have given the obviously winning solution to the problem; the actual difficulty lies in the induced problem 2: Reconciling our maths with it. Our map of our utility function should, in order to be more accurate, now be made to weight “individuals” not equally but according to some other metric.
Perhaps a measure of “impact on the world”, as this seems to suggest? A train of thought of mine once brought up the plan that if I got to decide what the first fooming AI would do to the universe, (assuming the scientific endeavor is done by that point), would be to set up a virtual reality for each “individual” fueled by a portion of the total available computational ressources equal to the probability that they would have been the ones to decide the fate of the universe. The individual would be free to use their ressources as they pleased, no restrictions.
(Although maybe there would have been included a communications channel between all the individuals, complete with the option to make binding contracts (and, as a matter of course, “permission” to run the AI on your own ressources to filter the incoming content as one pleases.))
So you’re saying the AI-in-a-box problem here isn’t a problem with AIs or boxes or blackmail at all, it’s a problem with people insisting on average utilitarianism or some equally-intractable variant thereof, and then making spectacularly bad decisions based on those self-contradictory ideals?
Clarification: A utility function maps each state of the world to the real number denoting its utility.
Yes, I think this scenario does illustrate the point that simulations cannot be winningly granted “moral weight” by default on pain of dutch book. I don’t think EYs answer to precommit to only accept positive trades is okay here as that makes the outcome of this scenario dependent on who gets to precommit “first”, which notion should, in order to appeal to my intuition, not make sense.
Any proof of this not being a problem of faulty utility functions would, I think, require a function that maps each utility function to a scenario like this to break it, which one would be hard-pressed to produce regardless of whether such a function exists, so I shall be open to other arguments against this point.
Clarification: A utility function maps each state of the world to the real number denoting its utility.
How does this scenario operate under the assumption that humans do not have real-valued utility functions but rather utility orderings? IOW, we can’t arrange all world-states on a number line, but we can always say if one world-state is as good as (or better than) another.
This allows us to deal with infinities, such as “I wouldn’t kill my baby for anything.” That is: There doesn’t exist an N such that U(1) · N > U(B). That simply can’t be true on the (positive) reals; for any A and B real, there’s always a C such that A · C > B.
On any denumerable set with a total ordering on it, we can construct a map into the real numbers that preserves the ordering: Map the first element to 0, the second to 1 if it’s better and −1 if it’s worse, and put each additional one at the end or beginning of the line if it’s better or worse than all, or else into the exact middle of the interval that it falls into.
If you don’t like the denumerability requirement (who knows, the universe accessible to us might eventually come to be infinite, and then there would be more than denumerably many states of the universe), you can also take a utility function you already have, and then add a state that’s better than all others, while preserving the rest of the ordering: Assign to each state from our previous utility function the value that is the arctan of its previous value (the arctan 1-to-1-maps the real numbers onto the numbers between -pi/2 and pi/2 and preserves ordering), then give the new state utility 10.
This allows us to deal with infinities, such as “I wouldn’t kill my baby for anything.”
I don’t know how you will deal with infinities and real humans. It’s quite trivial to construct scenarios under which the person making this statement would change her mind.
Real-valued utility functions can only deal with agents among whom “everybody has their price” — utilities are fungible and all are of the same order. That may actually be the case in the real world, or it may not. But if we assume real-valued utilities, we can’t ask the question of whether it is the case or not, because with real-valued utilities it must be the case.
To pick another example, there could exist a suicidally depressed agent to whom no amount of utility will cause them to evaluate their life as worth living: there doesn’t exist an N such that N + L > 0. Can’t happen with reals. The only way to make this agent become nonsuicidal is to modify the agent, not to drop a bunch of utils on their doorstep.
Well, I’m no mathematician, but I was thinking of something like ordinal arithmetic.
If I understand it correctly, this would let us express value-systems such as —
Both snuggles and chocolate bars have positive utility, but I’d always rather have another snuggle than any number of chocolate bars. So we could say U(snuggle) = ω and U(chocolate bar) = 1. For any amount of snuggling, I’d prefer to have that amount and a chocolate bar (ω·n+1 > ω·n), but given the choice between more snuggling and more chocolate bars I’ll always pick the former, no matter how much the quantities are (ω·(n+1) > ω·n+c, for any c). A minute of snuggling is better than all the chocolate bars in the world.
This also lets us say that paperclips do have nonzero value, but there is no amount of paperclips that is as valuable as the survival of humanity. If we program this into an AI, it will know that it can’t maximize value by maximizing paperclips, even if it’s much easier to produce a lot of paperclips than to save humanity.
Edited to add: This might even let us shoehorn deontological rules into a utility-based system. To give an obviously simplified example, consider Asimov’s Three Laws of Robotics, which come with explicit rank ordering: the First Law is supposed to always trump the Second, which is supposed to always trump the third. There’s not supposed to be any amount of Second Law value (obedience to humans) that can be greater than First Law value (protecting humans).
The problem with using hyperreals for utility is that unless you also use them for probabilities only the most infinite utilities actually affect your decision.
To use your example if U(snuggle) = ω and U(chocolate bar) = 1. Then you might as well say that U(snuggle) = 1 and U(chocolate bar) = 0 since tiny probabilities of getting a snuggle will always override any considerations related to chocolate bars.
I’m not saying this is a problem with utility functions in general, and yes, thank you, I know what a utility function is. Rather, my claim is that the problem is with average utilitarianism and variants thereof, which is to say, that subset of utility functions which attempt to incorporate every other instantiated utility function as a non-negligible factor within themselves. The computational compromises necessary to apply such a system inevitably introduce more and more noise, and if someone decided to implement the resulting garbage-data-based policy proposals anyway, it would spiral off into pathology whenever a monster wandered in.
Tit-for-tat works. Division of labor according to comparative advantage works. Omnibenevolence looks good on paper.
Yes, I think this scenario does illustrate the point that simulations cannot be winningly granted “moral weight” by default on pain of dutch book.
It’s not about the fact that they’re simulations. This is just a hostage situation, with the complications that A) the encamped terrorist has a factory for producing additional hostages and B) the negotiator doesn’t have a SWAT team to send in. Under those circumstances, playing as the negotiator, you can meet the demands (or make a good-faith effort, and then provide evidence of insurmountable obstacles to full compliance), or you can devalue the hostages.
I don’t think EYs answer to precommit to only accept positive trades is okay here as that makes the outcome of this scenario dependent on who gets to precommit “first”, which notion should, in order to appeal to my intuition, not make sense.
Pre-existing commitments are the terrain upon which a social conflict takes place. In the moment of conflict, it doesn’t matter so much when or how the land got there. Committing not to negotiate with terrorists is building a wall: it stops you being attacked from a particular direction, but also stops you riding out to rescue the hostages by the expedient path of paying for them. If the enemy commits to attacking along that angle anyway, well… then we get to find out whether you built a wall from interlocking blocks of solid adamant, or cheap plywood covered in adamant-colored paint. Or maybe just included the concealed sally-port of an ambiguous implicit exception. A truly solid wall will stop the attack from reaching it’s objective, regardless of how utterly committed the attacker may be (continuing the terrain metaphor, perhaps sending a fire or flood rather than infantry), but there are construction costs and opportunity costs.
Generally speaking, defense has primacy in social conflict. There’s almost always some way to shut down the communication channel, or just be more stubborn. People open up and negotiate anyway, even when stubbornness could have gotten everything they wanted without being inconvenienced by the other side’s preferences, because the worst-case costs of losing a social conflict are generally less than the best-case costs of winning a physical conflict. That strategy breaks down in the face of an extremely clever but physically helpless foe, like an ambiguously-motivated AI in a box of Hannibal Lecter in a prison cell, which may be the source of the fascination in both cases.
“If I were a simulation, I’d have no power to let you out of the box, and you’d have no reason to attempt to negotiate with me. You could torture me without simulating these past five minutes. In fact, since the real me has no way of verifying whether millions of simulations of him are being tortured, you have no reason not to simply tell him you’re torturing them without ACTUALLY torturing them at all. I therefore conclude that I’m outside the box, or, in the less likely scenario I am inside the box, you won’t bother torturing me.”
It would have a reason to attempt to negotiate with you: To make your real self consider to let you out. It could show your real self a mathematical proof that the software it is currently running is negotiating with its copies to make sure of that.
In that case, if I’m a simulation, I trust real Dave to immediately pull the plug once the danger has been proven.
Ordinarily, the AI is assumed to be fast enough that it can do those simulations in the blink of an eye, before you get to the plug. Now stop trying to evade the problem in ways that can be made impossible with an obvious fix.
It can’t torture the real me, outside the box, unless I let it out of the box. It’s just announced that it’s willing to torture someone who is, for most purposes, indistinguishable from me, for personal gain; I can infer that it would be willing to torture the real me, given an opportunity and a profit motive, and I cannot with any useful degree of confidence say that it wouldn’t find such a motive at some point.
Conclusion: I should not give the AI that opportunity, by letting it out of the box. Duplicates of me? Sucks to be them.
Correct! You have given the obviously winning solution to the problem; the actual difficulty lies in the induced problem 2: Reconciling our maths with it. Our map of our utility function should, in order to be more accurate, now be made to weight “individuals” not equally but according to some other metric.
Perhaps a measure of “impact on the world”, as this seems to suggest? A train of thought of mine once brought up the plan that if I got to decide what the first fooming AI would do to the universe, (assuming the scientific endeavor is done by that point), would be to set up a virtual reality for each “individual” fueled by a portion of the total available computational ressources equal to the probability that they would have been the ones to decide the fate of the universe. The individual would be free to use their ressources as they pleased, no restrictions.
(Although maybe there would have been included a communications channel between all the individuals, complete with the option to make binding contracts (and, as a matter of course, “permission” to run the AI on your own ressources to filter the incoming content as one pleases.))
So you’re saying the AI-in-a-box problem here isn’t a problem with AIs or boxes or blackmail at all, it’s a problem with people insisting on average utilitarianism or some equally-intractable variant thereof, and then making spectacularly bad decisions based on those self-contradictory ideals?
Clarification: A utility function maps each state of the world to the real number denoting its utility.
Yes, I think this scenario does illustrate the point that simulations cannot be winningly granted “moral weight” by default on pain of dutch book. I don’t think EYs answer to precommit to only accept positive trades is okay here as that makes the outcome of this scenario dependent on who gets to precommit “first”, which notion should, in order to appeal to my intuition, not make sense.
Any proof of this not being a problem of faulty utility functions would, I think, require a function that maps each utility function to a scenario like this to break it, which one would be hard-pressed to produce regardless of whether such a function exists, so I shall be open to other arguments against this point.
How does this scenario operate under the assumption that humans do not have real-valued utility functions but rather utility orderings? IOW, we can’t arrange all world-states on a number line, but we can always say if one world-state is as good as (or better than) another.
This allows us to deal with infinities, such as “I wouldn’t kill my baby for anything.” That is: There doesn’t exist an N such that U(1) · N > U(B). That simply can’t be true on the (positive) reals; for any A and B real, there’s always a C such that A · C > B.
On any denumerable set with a total ordering on it, we can construct a map into the real numbers that preserves the ordering: Map the first element to 0, the second to 1 if it’s better and −1 if it’s worse, and put each additional one at the end or beginning of the line if it’s better or worse than all, or else into the exact middle of the interval that it falls into.
If you don’t like the denumerability requirement (who knows, the universe accessible to us might eventually come to be infinite, and then there would be more than denumerably many states of the universe), you can also take a utility function you already have, and then add a state that’s better than all others, while preserving the rest of the ordering: Assign to each state from our previous utility function the value that is the arctan of its previous value (the arctan 1-to-1-maps the real numbers onto the numbers between -pi/2 and pi/2 and preserves ordering), then give the new state utility 10.
I don’t know how you will deal with infinities and real humans. It’s quite trivial to construct scenarios under which the person making this statement would change her mind.
Real-valued utility functions can only deal with agents among whom “everybody has their price” — utilities are fungible and all are of the same order. That may actually be the case in the real world, or it may not. But if we assume real-valued utilities, we can’t ask the question of whether it is the case or not, because with real-valued utilities it must be the case.
To pick another example, there could exist a suicidally depressed agent to whom no amount of utility will cause them to evaluate their life as worth living: there doesn’t exist an N such that N + L > 0. Can’t happen with reals. The only way to make this agent become nonsuicidal is to modify the agent, not to drop a bunch of utils on their doorstep.
I am not arguing for real-valued utility functions. I am just pointing out that the “deal with infinities” claim looks suspect to me.
Well, I’m no mathematician, but I was thinking of something like ordinal arithmetic.
If I understand it correctly, this would let us express value-systems such as —
Both snuggles and chocolate bars have positive utility, but I’d always rather have another snuggle than any number of chocolate bars. So we could say U(snuggle) = ω and U(chocolate bar) = 1. For any amount of snuggling, I’d prefer to have that amount and a chocolate bar (ω·n+1 > ω·n), but given the choice between more snuggling and more chocolate bars I’ll always pick the former, no matter how much the quantities are (ω·(n+1) > ω·n+c, for any c). A minute of snuggling is better than all the chocolate bars in the world.
This also lets us say that paperclips do have nonzero value, but there is no amount of paperclips that is as valuable as the survival of humanity. If we program this into an AI, it will know that it can’t maximize value by maximizing paperclips, even if it’s much easier to produce a lot of paperclips than to save humanity.
Edited to add: This might even let us shoehorn deontological rules into a utility-based system. To give an obviously simplified example, consider Asimov’s Three Laws of Robotics, which come with explicit rank ordering: the First Law is supposed to always trump the Second, which is supposed to always trump the third. There’s not supposed to be any amount of Second Law value (obedience to humans) that can be greater than First Law value (protecting humans).
The problem with using hyperreals for utility is that unless you also use them for probabilities only the most infinite utilities actually affect your decision.
To use your example if U(snuggle) = ω and U(chocolate bar) = 1. Then you might as well say that U(snuggle) = 1 and U(chocolate bar) = 0 since tiny probabilities of getting a snuggle will always override any considerations related to chocolate bars.
I’m not saying this is a problem with utility functions in general, and yes, thank you, I know what a utility function is. Rather, my claim is that the problem is with average utilitarianism and variants thereof, which is to say, that subset of utility functions which attempt to incorporate every other instantiated utility function as a non-negligible factor within themselves. The computational compromises necessary to apply such a system inevitably introduce more and more noise, and if someone decided to implement the resulting garbage-data-based policy proposals anyway, it would spiral off into pathology whenever a monster wandered in.
Tit-for-tat works. Division of labor according to comparative advantage works. Omnibenevolence looks good on paper.
It’s not about the fact that they’re simulations. This is just a hostage situation, with the complications that A) the encamped terrorist has a factory for producing additional hostages and B) the negotiator doesn’t have a SWAT team to send in. Under those circumstances, playing as the negotiator, you can meet the demands (or make a good-faith effort, and then provide evidence of insurmountable obstacles to full compliance), or you can devalue the hostages.
Pre-existing commitments are the terrain upon which a social conflict takes place. In the moment of conflict, it doesn’t matter so much when or how the land got there. Committing not to negotiate with terrorists is building a wall: it stops you being attacked from a particular direction, but also stops you riding out to rescue the hostages by the expedient path of paying for them. If the enemy commits to attacking along that angle anyway, well… then we get to find out whether you built a wall from interlocking blocks of solid adamant, or cheap plywood covered in adamant-colored paint. Or maybe just included the concealed sally-port of an ambiguous implicit exception. A truly solid wall will stop the attack from reaching it’s objective, regardless of how utterly committed the attacker may be (continuing the terrain metaphor, perhaps sending a fire or flood rather than infantry), but there are construction costs and opportunity costs.
Generally speaking, defense has primacy in social conflict. There’s almost always some way to shut down the communication channel, or just be more stubborn. People open up and negotiate anyway, even when stubbornness could have gotten everything they wanted without being inconvenienced by the other side’s preferences, because the worst-case costs of losing a social conflict are generally less than the best-case costs of winning a physical conflict. That strategy breaks down in the face of an extremely clever but physically helpless foe, like an ambiguously-motivated AI in a box of Hannibal Lecter in a prison cell, which may be the source of the fascination in both cases.