What is the mathematical implementation of indifference?
Armstrong suggests that you implement it as a special value that is exactly equal to every other utility value. So if the AI comes up with an outcome that would yield +751 utility, it would treat being destroyed as having +751 utility. Whatever algorithm you choose for resolving ties determines whether the AI suicides (by doing something that causes its human managers to kill it).
Let’s see how this works with a hypothetical example. Our good old friend the Paperclip Maximizer to the rescue!
Papey wants more paperclips. It compares several possibilities. In one, it generates 1,000 paperclips before its human operators kill it. In another, it generates 999 paperclips, but its human operators leave it alive, and it will have future chances to create paperclips. It expects to remain alive long enough to generate exactly two additional paperclips this way.
Now, Papey’s decision algorithm chooses between outcomes of equal utility in a uniformly random manner. Papey has two possible outcomes right now: suicide (for 1001 pseudo-paperclips, since suicide is always worth as much as any other decision) or generate 999 paperclips now and an expected 2 paperclips later (for 1001 paperclips).
At this point, Papey will, with 50% probability, generate 999 paperclips; otherwise, Papey will do something to cause his operators to kill him. (Not sure what, but presumably it won’t involve destroying paperclips.) Poor Papey. There’s a 99% probability that he’ll suicide within seven decisions.
Now, if the tie breaking algorithm isn’t random, in a given situation, it will either go with suicide deterministically or go against suicide deterministically. You can always go against suicide, in which case you may as well omit utility indifference entirely. Or you can always go for suicide, in which case you found a fancy way of wasting money.
Or you could do something interesting. But if you had something to put here that guaranteed a good result, you could use that as the utility function.
Friendly AI is indifferent to utility indifference.
I wasn’t advocating using utility indifference to death for this example, more “utility indifference as to whether the user ever reads its output” or something along those lines.
But let’s look at your example more in detail. You say the AI has two choices: it makes 1000 paperclips and is killed, or it makes 999 and is left alive, and then goes on to make another 2.
In this scenario, there is only one case of death, coming from the first decision. So utility indifference does not change anything about the impact of the second decision. What indifference means is that if humans try and blow Papey up after the 1000 paper clips, it will derive as much utility from a successful blow up as from a failed blow up (in less correct but more intuitive terms, you could say “Papey acts as if it doesn’t believe the explosives will work”).
So how many paper clips could Papey make if the explosives fail? Let’s assume another 2, for a grand total of 1000.
So the paperclip-utility of the first course is is 1002, that of the second is 1001: it will thus take the first course. Hence it will misbehave visibly (rather than keeping that a secret), and will likely end up blown up by us.
Now imagine it could get rid of the explosives at no costs. Here, your point has merit: Papey is indifferent between removing the explosives or not removing it, and it tie breaker algorithm will determine what it does. But let’s make this more realistic: removing the explosives has a tiny cost, say a 50% chance it will fail to manufacture one paperclip. So now it can:
1) Misbehave, leave explosives in place: expected utility 1002.
2) Pretend to behave: expected utility 999+2 = 1001.
3) Misbehave, remove explosives in place: expected utility 1002-0.5=1001.5.
Hence Papey will choose the first option, we get to see that it misbehaves, and we likely blow it up. That was the initial point of utility indifference.
But as I said here, we wouldn’t be using utility indifference in this way. It would more be to make the AI “behave as if it believed that nobody would get to read its printout.”
it will derive as much utility from a successful blow up as from a failed blow up (in less correct but more intuitive terms, you could say “Papey acts as if it doesn’t believe the explosives will work”).
Those are quite different. In one case, you’re adjusting its utility function. In the other, you’re adjusting its expectations. (And the remainder of your post indicates that you intend to adjust the AI’s expectations of what will happen, and don’t care about its utility function.)
Now, that might work. If the AI is self-modifying, then it might notice that that part of its planning function is obviously incorrect and fix it. It might simulate human responses to its proposed actions and notice that there are problematic conflicts around that area. But there’s at least a chance that the AI will not notice that.
I wouldn’t want to trust my life to it, but it’s got a shot of working.
The method is entirely about adjusting the utility function, not the expectations, as I explained in the paper. And the adjustments are ideally to be made from the very start, so that the AI never has cause to fear subsequent adjustment to its utility.
Rereading the paper, it looks like you’re mucking about with utility, not expected utility. So let’s see how this works out in practice.
You assign the AI a task of finding a cure for cancer. This will net it 10 utilons if it succeeds. Thanks to your utility indifference module, it will also net 10 utilons if it does not find a cure for cancer and you trigger the explosives to kill the AI.
The AI does a preliminary estimate to see its odds of success for the relevant outcomes. It can find a cure for cancer with 80% likelihood. It can cause you to trigger the explosives and kill the AI with 90% likelihood.
The expected utility for attempting to cure cancer is 8. The expected utility for attempting suicide is 9.
It seems pretty straightforward and easy to pose as a mad, dangerous AI about to destroy the world, at least well enough to make a human panic and hit the emergency destruct button. I’m sure I could convince you to kill me, if you just had to push a button. It seems a lot less certain that the AI would be able to solve the problems given it, since those problems will be things that humans haven’t been able to solve, and since the AI will be under various other restrictions to keep it safe.
You assign the AI a task of finding a cure for cancer. This will net it 10 utilons if it succeeds. Thanks to your utility indifference module, it will also net 10 utilons if it does not find a cure for cancer and you trigger the explosives to kill the AI.
The AI gets nothing from you deciding to kill it. What utility indifference means is that it will get the same utility if you decide to kill it, and the explosives go off, as if you decided to kill it, and the explosives failed to go off.
So if it expects to cure cancer with 80% likelihood if it survives, and thus get 8 utilitons, then if if dies from the explosion, it will also get 8 utilitons.
More precisely there are six worlds, depending on a) whether we trigger the explosives or not, b) whether it survives the explosion or not, and c) whether cancer is curable by an unmolested AI. In obvious notation, these are TSC, TS¬C, T¬SC, T¬S¬C, ¬TC, ¬T¬C. We are assuming in your example that U(¬T¬C)=U(TS¬C)=0 and U(¬TC)=U(TSC)=10. Utility indifference then sets U(T¬S¬C)=0 and U(T¬SC)=10.
Hence assuming that S is independent of C (the curability of cancer by an unmolested AI is independent of whether particular explosives go off), we see that the AI is indifferent as to whether the explosives go off—it gains nothing either way, so has no cause to influence the explosives in any direction. If we further assume T and C are independent (the curability of cancer by an unmolested AI is independent of whether we decide to blow up this particular AI), then the AI also has no preference as to whether we trigger the explosives or not—it gains nothing either way.
Ah, that makes sense. It isn’t indifferent to suicide as such; it’s only indifferent to your success at attempting to kill it, should you make the attempt.
What is the mathematical implementation of indifference?
Armstrong suggests that you implement it as a special value that is exactly equal to every other utility value. So if the AI comes up with an outcome that would yield +751 utility, it would treat being destroyed as having +751 utility. Whatever algorithm you choose for resolving ties determines whether the AI suicides (by doing something that causes its human managers to kill it).
Let’s see how this works with a hypothetical example. Our good old friend the Paperclip Maximizer to the rescue!
Papey wants more paperclips. It compares several possibilities. In one, it generates 1,000 paperclips before its human operators kill it. In another, it generates 999 paperclips, but its human operators leave it alive, and it will have future chances to create paperclips. It expects to remain alive long enough to generate exactly two additional paperclips this way.
Now, Papey’s decision algorithm chooses between outcomes of equal utility in a uniformly random manner. Papey has two possible outcomes right now: suicide (for 1001 pseudo-paperclips, since suicide is always worth as much as any other decision) or generate 999 paperclips now and an expected 2 paperclips later (for 1001 paperclips).
At this point, Papey will, with 50% probability, generate 999 paperclips; otherwise, Papey will do something to cause his operators to kill him. (Not sure what, but presumably it won’t involve destroying paperclips.) Poor Papey. There’s a 99% probability that he’ll suicide within seven decisions.
Now, if the tie breaking algorithm isn’t random, in a given situation, it will either go with suicide deterministically or go against suicide deterministically. You can always go against suicide, in which case you may as well omit utility indifference entirely. Or you can always go for suicide, in which case you found a fancy way of wasting money.
Or you could do something interesting. But if you had something to put here that guaranteed a good result, you could use that as the utility function.
Friendly AI is indifferent to utility indifference.
I wasn’t advocating using utility indifference to death for this example, more “utility indifference as to whether the user ever reads its output” or something along those lines.
But let’s look at your example more in detail. You say the AI has two choices: it makes 1000 paperclips and is killed, or it makes 999 and is left alive, and then goes on to make another 2.
In this scenario, there is only one case of death, coming from the first decision. So utility indifference does not change anything about the impact of the second decision. What indifference means is that if humans try and blow Papey up after the 1000 paper clips, it will derive as much utility from a successful blow up as from a failed blow up (in less correct but more intuitive terms, you could say “Papey acts as if it doesn’t believe the explosives will work”).
So how many paper clips could Papey make if the explosives fail? Let’s assume another 2, for a grand total of 1000.
So the paperclip-utility of the first course is is 1002, that of the second is 1001: it will thus take the first course. Hence it will misbehave visibly (rather than keeping that a secret), and will likely end up blown up by us.
Now imagine it could get rid of the explosives at no costs. Here, your point has merit: Papey is indifferent between removing the explosives or not removing it, and it tie breaker algorithm will determine what it does. But let’s make this more realistic: removing the explosives has a tiny cost, say a 50% chance it will fail to manufacture one paperclip. So now it can:
1) Misbehave, leave explosives in place: expected utility 1002.
2) Pretend to behave: expected utility 999+2 = 1001.
3) Misbehave, remove explosives in place: expected utility 1002-0.5=1001.5.
Hence Papey will choose the first option, we get to see that it misbehaves, and we likely blow it up. That was the initial point of utility indifference.
But as I said here, we wouldn’t be using utility indifference in this way. It would more be to make the AI “behave as if it believed that nobody would get to read its printout.”
Those are quite different. In one case, you’re adjusting its utility function. In the other, you’re adjusting its expectations. (And the remainder of your post indicates that you intend to adjust the AI’s expectations of what will happen, and don’t care about its utility function.)
Now, that might work. If the AI is self-modifying, then it might notice that that part of its planning function is obviously incorrect and fix it. It might simulate human responses to its proposed actions and notice that there are problematic conflicts around that area. But there’s at least a chance that the AI will not notice that.
I wouldn’t want to trust my life to it, but it’s got a shot of working.
The method is entirely about adjusting the utility function, not the expectations, as I explained in the paper. And the adjustments are ideally to be made from the very start, so that the AI never has cause to fear subsequent adjustment to its utility.
Rereading the paper, it looks like you’re mucking about with utility, not expected utility. So let’s see how this works out in practice.
You assign the AI a task of finding a cure for cancer. This will net it 10 utilons if it succeeds. Thanks to your utility indifference module, it will also net 10 utilons if it does not find a cure for cancer and you trigger the explosives to kill the AI.
The AI does a preliminary estimate to see its odds of success for the relevant outcomes. It can find a cure for cancer with 80% likelihood. It can cause you to trigger the explosives and kill the AI with 90% likelihood.
The expected utility for attempting to cure cancer is 8. The expected utility for attempting suicide is 9.
It seems pretty straightforward and easy to pose as a mad, dangerous AI about to destroy the world, at least well enough to make a human panic and hit the emergency destruct button. I’m sure I could convince you to kill me, if you just had to push a button. It seems a lot less certain that the AI would be able to solve the problems given it, since those problems will be things that humans haven’t been able to solve, and since the AI will be under various other restrictions to keep it safe.
The AI gets nothing from you deciding to kill it. What utility indifference means is that it will get the same utility if you decide to kill it, and the explosives go off, as if you decided to kill it, and the explosives failed to go off.
So if it expects to cure cancer with 80% likelihood if it survives, and thus get 8 utilitons, then if if dies from the explosion, it will also get 8 utilitons.
More precisely there are six worlds, depending on a) whether we trigger the explosives or not, b) whether it survives the explosion or not, and c) whether cancer is curable by an unmolested AI. In obvious notation, these are TSC, TS¬C, T¬SC, T¬S¬C, ¬TC, ¬T¬C. We are assuming in your example that U(¬T¬C)=U(TS¬C)=0 and U(¬TC)=U(TSC)=10. Utility indifference then sets U(T¬S¬C)=0 and U(T¬SC)=10.
Hence assuming that S is independent of C (the curability of cancer by an unmolested AI is independent of whether particular explosives go off), we see that the AI is indifferent as to whether the explosives go off—it gains nothing either way, so has no cause to influence the explosives in any direction. If we further assume T and C are independent (the curability of cancer by an unmolested AI is independent of whether we decide to blow up this particular AI), then the AI also has no preference as to whether we trigger the explosives or not—it gains nothing either way.
Ah, that makes sense. It isn’t indifferent to suicide as such; it’s only indifferent to your success at attempting to kill it, should you make the attempt.
Thanks for your patience!
No prob :-) Always happy when I manage to explain something successfully!