Nice! What about conditions that break the symmetry between N and S, though?
Suppose there are two actions A and B, and “on switch” o. Maybe we only want the AI to care about what happens when the on switch is on, and not what happens when the switch is off.
So we replace the “pure bets” condition with the “switched bets” condition: If P(o|A) = P(o|B), and E(U|A,o)>E(U|B,o), take action A.
Now the example with Northland and Southland doesn’t go through the same, because we have to pick one of the countries to asymmetrically be the one where things matter if it wins, and this leads the AI to sending a bet that the chosen country will win to that country (hurting its chances, but it doesn’t switch to betting on the opposite country because that doesn’t improve its payoff when the chosen country wins).
If we implement your example, the AI is willing to bet at arbitrarily poor odds that the on switch will be on, thus violating the sane pure bets condition.
You can have particular decision problems or action spaces that don’t have the circular property of the Northland-Southland problem, but the fact remains that if an AI fulfills the weak indifference condition reliably, it must violate the sane pure bets scenario in some circumstances. There must be insane bets that it’s willing to take, even if no such bets are available in a particular situation.
Basically, rather than thinking about an AI in a particular scenario, the proof is talking about conditions that it’s impossible for an AI to fulfill in all scenarios.
I could construct a trivial decision problem where the AI only has one action it can take, and then the sane pure bets condition and weak indifference condition are both irrelevant to that decision problem. But when we place the same AI in different scenarios, there must exist some scenarios where it violates at least one of the conditions.
If we implement your example, the AI is willing to bet at arbitrarily poor odds that the on switch will be on, thus violating the sane pure bets condition.
Yes. But the symmetry of the sane pure bets condition doesn’t quite match what we want from corrigibility anyhow. I don’t want an AI with a shutdown button to be making contingency plans to ensure good outcomes for itself even when the shutdown button is pressed.
Yes, the point of the proof isn’t that the sane pure bets condition and the weak indifference condition are the be-all and end-all of corrigibility. But using the proof’s result, I can notice that your AI will be happy to bet a million dollars against one cent that the shutdown button won’t be pressed, which doesn’t seem desirable. It’s effectively willing to burn arbitrary amounts of utility, if we present it with the right bets.
Ideally, a successful solution to the shutdown problem should violate one or both of these conditions in clear, limited ways which don’t result in unsafe behavior, or which result in suboptimal behavior whose suboptimality falls within well-defined bounds. Rather than guessing-and-checking potential solutions and being surprised when they fail to satisfy both conditions, we should look specifically for non-sane-pure-betters and non-intuitively-indifferent-agents which nevertheless behave corrigibly and desirably.
Nice! What about conditions that break the symmetry between N and S, though?
Suppose there are two actions A and B, and “on switch” o. Maybe we only want the AI to care about what happens when the on switch is on, and not what happens when the switch is off.
So we replace the “pure bets” condition with the “switched bets” condition: If P(o|A) = P(o|B), and E(U|A,o)>E(U|B,o), take action A.
Now the example with Northland and Southland doesn’t go through the same, because we have to pick one of the countries to asymmetrically be the one where things matter if it wins, and this leads the AI to sending a bet that the chosen country will win to that country (hurting its chances, but it doesn’t switch to betting on the opposite country because that doesn’t improve its payoff when the chosen country wins).
If we implement your example, the AI is willing to bet at arbitrarily poor odds that the on switch will be on, thus violating the sane pure bets condition.
You can have particular decision problems or action spaces that don’t have the circular property of the Northland-Southland problem, but the fact remains that if an AI fulfills the weak indifference condition reliably, it must violate the sane pure bets scenario in some circumstances. There must be insane bets that it’s willing to take, even if no such bets are available in a particular situation.
Basically, rather than thinking about an AI in a particular scenario, the proof is talking about conditions that it’s impossible for an AI to fulfill in all scenarios.
I could construct a trivial decision problem where the AI only has one action it can take, and then the sane pure bets condition and weak indifference condition are both irrelevant to that decision problem. But when we place the same AI in different scenarios, there must exist some scenarios where it violates at least one of the conditions.
Yes. But the symmetry of the sane pure bets condition doesn’t quite match what we want from corrigibility anyhow. I don’t want an AI with a shutdown button to be making contingency plans to ensure good outcomes for itself even when the shutdown button is pressed.
Yes, the point of the proof isn’t that the sane pure bets condition and the weak indifference condition are the be-all and end-all of corrigibility. But using the proof’s result, I can notice that your AI will be happy to bet a million dollars against one cent that the shutdown button won’t be pressed, which doesn’t seem desirable. It’s effectively willing to burn arbitrary amounts of utility, if we present it with the right bets.