Yes, the point of the proof isn’t that the sane pure bets condition and the weak indifference condition are the be-all and end-all of corrigibility. But using the proof’s result, I can notice that your AI will be happy to bet a million dollars against one cent that the shutdown button won’t be pressed, which doesn’t seem desirable. It’s effectively willing to burn arbitrary amounts of utility, if we present it with the right bets.
Ideally, a successful solution to the shutdown problem should violate one or both of these conditions in clear, limited ways which don’t result in unsafe behavior, or which result in suboptimal behavior whose suboptimality falls within well-defined bounds. Rather than guessing-and-checking potential solutions and being surprised when they fail to satisfy both conditions, we should look specifically for non-sane-pure-betters and non-intuitively-indifferent-agents which nevertheless behave corrigibly and desirably.
Yes, the point of the proof isn’t that the sane pure bets condition and the weak indifference condition are the be-all and end-all of corrigibility. But using the proof’s result, I can notice that your AI will be happy to bet a million dollars against one cent that the shutdown button won’t be pressed, which doesn’t seem desirable. It’s effectively willing to burn arbitrary amounts of utility, if we present it with the right bets.