There’s another argument I think you might have missed:
Utilitarism is about being optimal. Instinctive morality is about being failsafe.
Implicit in all decisions is a nonzero possibility that you are wrong. Once you take that into account, having some “hard” rules like not agreeing to torture here (or in other dilemmas), not pushing the fat guy on the tracks in the trolley problem, etc, can save you from making horrible mistakes at the cost of slightly suboptimal decisions. Which is, incidentally, how I would want a friendly AI to decide as well—losing a bit in the average case to prevent a really horrible worst case.
That rule alone would, of course, make you vulnerable to Pascal’s Mugging. I think the way to go here is to have some threshold at which you round very low (or very high) probabilities off to zero (or one) when the difference is small against the probability of you being wrong. Not only will this protect you against getting your decisions hacked, it will also stop you from wasting computing power on improbable outcomes. This seems to be the reason why Pascal’s Mugging usually fails on humans.
Both of these are necessary patches because we operate on opaque, faulty and potentially hostile hardware. One without the other is vulnerable to hacks and catastrophic failure modes, but both taken together are a pretty strong base for decisions that, so far, have served us humans pretty well. In two rules:
1) Ignore outcomes to which you assign a lower probability than to you being wrong/mistaken about the situation.
2) Ignore decisions with horrible worst case scenarios if there are options with a less horrible worst case and still acceptable acceptable average case.
When both of these apply to the same thing, or this process eliminates all options, you have a dilemma. Try to reduce your uncertainty about 1) and start looking for other options in 2). If that is impossible, shut up and do it anyway.
There’s another argument I think you might have missed:
Utilitarism is about being optimal. Instinctive morality is about being failsafe.
Implicit in all decisions is a nonzero possibility that you are wrong. Once you take that into account, having some “hard” rules like not agreeing to torture here (or in other dilemmas), not pushing the fat guy on the tracks in the trolley problem, etc, can save you from making horrible mistakes at the cost of slightly suboptimal decisions. Which is, incidentally, how I would want a friendly AI to decide as well—losing a bit in the average case to prevent a really horrible worst case.
That rule alone would, of course, make you vulnerable to Pascal’s Mugging. I think the way to go here is to have some threshold at which you round very low (or very high) probabilities off to zero (or one) when the difference is small against the probability of you being wrong. Not only will this protect you against getting your decisions hacked, it will also stop you from wasting computing power on improbable outcomes. This seems to be the reason why Pascal’s Mugging usually fails on humans.
Both of these are necessary patches because we operate on opaque, faulty and potentially hostile hardware. One without the other is vulnerable to hacks and catastrophic failure modes, but both taken together are a pretty strong base for decisions that, so far, have served us humans pretty well. In two rules:
1) Ignore outcomes to which you assign a lower probability than to you being wrong/mistaken about the situation. 2) Ignore decisions with horrible worst case scenarios if there are options with a less horrible worst case and still acceptable acceptable average case.
When both of these apply to the same thing, or this process eliminates all options, you have a dilemma. Try to reduce your uncertainty about 1) and start looking for other options in 2). If that is impossible, shut up and do it anyway.