This post tries to push back against the role of expected utility theory in AI safety by arguing against various ways to derive expected utility axiomatically. I heard many such arguments before, and IMO they are never especially useful. This post is no exception.
The OP presents the position it argues against as follows (in my paraphrasing): “Sufficiently advanced agents don’t play dominated strategies, therefore, because of [theorem], they have to be expected utility maximizers, therefore they have to be goal-directed and [other conclusions]”. They then proceed to argue that there is no theorem that can make this argument go through.
I think that this entire framing is attacking a weak man. The real argument for expected utility theory is:
In AI safety, we are from the get-go interested in goal-directed systems because (i) we want AIs to achieve goals for us (ii) we are worried about systems with bad goals and (iii) stopping systems with bad goals is also a goal.
The next question is then, what is a useful mathematical formalism for studying goal-directed systems.
The theorems quoted in the OP are moderate evidence that expected utility has to be part of this formalism, because their assumptions resonate a lot with our intuitions for what “rational goal-directed behavior” is. Yes, of course we can still quibble with the assumptions (like the OP does in some cases), which is why I say “moderate evidence” rather than “completely watertight proof”, but given how natural the assumptions are, the evidence is good.
More importantly, the theorems are only a small part of the evidence base. A philosophical question is never fully answered by a single theorem. Instead, the evidence base is holistic: looking at the theoretical edifices growing up from expected utility (control theory, learning theory, game theory etc) one becomes progressively more and more convinced that expected utility correctly captures some of the core intuitions behind “goal-directedness”.
If one does want to present a convincing case against expected utility, quibbling with the assumption of VNM or whatnot is an incredibly weak move. Instead, show us where the entire edifice of existing theory runs ashore because of expected utility and how some alternative to expected utility can do better (as an analogy, see how infra-Bayesianism supplants Bayesian decision theory).
In conclusion, there are coherence theorems. But, more important than individual theorems are the “coherence theories”.
Thanks. I agree with your first four bulletpoints. I disagree that the post is quibbling. Weak man or not, the-coherence-argument-as-I-stated-it was prominent on LW for a long time. And figuring out the truth here matters. If the coherence argument doesn’t work, we can (try to) use incomplete preferences to keep agents shutdownable. As I write elsewhere:
The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI failed because they only considered agents with complete preferences. Useful agents with complete (and transitive and option-set-independent) preferences will often have some preference regarding the pressing of the shutdown button, as this theorem shows. MIRI thought that they had to assume completeness, because of coherence arguments. But coherence arguments aremistaken: there are no theorems which imply that agents must have complete preferences in order to avoid pursuing dominated strategies. So we can relax the assumption of completeness and use this extra leeway to find a formula for a corrigible consequentialist. That formula is what I purport to give in this post.
I feel that coherence arguments, broadly construed, are a reason to be skeptical of such proposals, but debating coherence arguments because of this seems backward. Instead, we should just be discussing your proposal directly. Since I haven’t read your proposal yet, I don’t have an opinion, but some coherence-inspired question I would be asking are:
Can you define an incomplete-preferences AIXI consistent with this proposal?
Is there an incomplete-preferences version of RL regret bound theory consistent with this proposal?
What happens when your agent is constructing a new agent? Does the new agent inherit the same incomplete preferences?
This post tries to push back against the role of expected utility theory in AI safety by arguing against various ways to derive expected utility axiomatically. I heard many such arguments before, and IMO they are never especially useful. This post is no exception.
The OP presents the position it argues against as follows (in my paraphrasing): “Sufficiently advanced agents don’t play dominated strategies, therefore, because of [theorem], they have to be expected utility maximizers, therefore they have to be goal-directed and [other conclusions]”. They then proceed to argue that there is no theorem that can make this argument go through.
I think that this entire framing is attacking a weak man. The real argument for expected utility theory is:
In AI safety, we are from the get-go interested in goal-directed systems because (i) we want AIs to achieve goals for us (ii) we are worried about systems with bad goals and (iii) stopping systems with bad goals is also a goal.
The next question is then, what is a useful mathematical formalism for studying goal-directed systems.
The theorems quoted in the OP are moderate evidence that expected utility has to be part of this formalism, because their assumptions resonate a lot with our intuitions for what “rational goal-directed behavior” is. Yes, of course we can still quibble with the assumptions (like the OP does in some cases), which is why I say “moderate evidence” rather than “completely watertight proof”, but given how natural the assumptions are, the evidence is good.
More importantly, the theorems are only a small part of the evidence base. A philosophical question is never fully answered by a single theorem. Instead, the evidence base is holistic: looking at the theoretical edifices growing up from expected utility (control theory, learning theory, game theory etc) one becomes progressively more and more convinced that expected utility correctly captures some of the core intuitions behind “goal-directedness”.
If one does want to present a convincing case against expected utility, quibbling with the assumption of VNM or whatnot is an incredibly weak move. Instead, show us where the entire edifice of existing theory runs ashore because of expected utility and how some alternative to expected utility can do better (as an analogy, see how infra-Bayesianism supplants Bayesian decision theory).
In conclusion, there are coherence theorems. But, more important than individual theorems are the “coherence theories”.
Thanks. I agree with your first four bulletpoints. I disagree that the post is quibbling. Weak man or not, the-coherence-argument-as-I-stated-it was prominent on LW for a long time. And figuring out the truth here matters. If the coherence argument doesn’t work, we can (try to) use incomplete preferences to keep agents shutdownable. As I write elsewhere:
I feel that coherence arguments, broadly construed, are a reason to be skeptical of such proposals, but debating coherence arguments because of this seems backward. Instead, we should just be discussing your proposal directly. Since I haven’t read your proposal yet, I don’t have an opinion, but some coherence-inspired question I would be asking are:
Can you define an incomplete-preferences AIXI consistent with this proposal?
Is there an incomplete-preferences version of RL regret bound theory consistent with this proposal?
What happens when your agent is constructing a new agent? Does the new agent inherit the same incomplete preferences?