I’m a Postdoctoral Research Fellow at Oxford University’s Global Priorities Institute.
Previously, I was a Philosophy Fellow at the Center for AI Safety.
So far, my work has mostly been about the moral importance of future generations. Going forward, it will mostly be about AI.
You can email me at elliott.thornley@philosophy.ox.ac.uk.
I’m coming to this two weeks late, but here are my thoughts.
The question of interest is:
Will sufficiently-advanced artificial agents be representable as maximizing expected utility?
Rephrased:
Will sufficiently-advanced artificial agents satisfy the VNM axioms (Completeness, Transitivity, Independence, and Continuity)?
Coherence arguments purport to establish that the answer is yes. These arguments go like this:
There exist theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.
Sufficiently-advanced artificial agents will not pursue dominated strategies.
So, sufficiently-advanced artificial agents will be representable as maximizing expected utility.
These arguments don’t work, because premise 1 is false: there are no theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies. In the year since I published my post, no one has disputed that.
Now to address two prominent responses:
‘I define ‘coherence theorems’ differently.’
In the post, I used the term ‘coherence theorems’ to refer to ‘theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.’ I took that to be the usual definition on LessWrong (see the Appendix for why), but some people replied that they meant something different by ‘coherence theorems’: e.g. ‘theorems that are relevant to the question of agent coherence.’
All well and good. If you use that definition, then there are coherence theorems. But if you use that definition, then coherence theorems can’t play the role that they’re supposed to play in coherence arguments. Premise 1 of the coherence argument is still false. That’s the important point.
‘The mistake is benign.’
This is a crude summary of Rohin’s response. Rohin and I agree that the Complete Class Theorem implies the following: ‘If an agent has complete and transitive preferences, then unless the agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.’ So the mistake is neglecting to say ‘If an agent has complete and transitive preferences…’ Rohin thinks this mistake is benign.
I don’t think the mistake is benign. As my rephrasing of the question of interest above makes clear, Completeness and Transitivity are a major part of what coherence arguments aim to establish! So it’s crucial to note that the Complete Class Theorem gives us no reason to think that sufficiently-advanced artificial agents will have complete or transitive preferences, especially since:
Completeness doesn’t come for free.
Money-pump arguments for Completeness (applied to artificial agents) aren’t convincing.
Money-pump arguments for Transitivity assume Completeness.
Training agents to violate Completeness might keep them shutdownable.
Two important points
Here are two important points, which I make to preclude misreadings of the post:
Future artificial agents—trained in a standard way—might still be representable as maximizing expected utility.
Coherence arguments don’t work, but there might well be other reasons to think that future artificial agents—trained in a standard way—will be representable as maximizing expected utility.
Artificial agents not representable as maximizing expected utility can still be dangerous.
So why does the post matter?
The post matters because ‘train artificial agents to have incomplete preferences’ looks promising as a way of ensuring that these agents allow us to shut them down.
AI safety researchers haven’t previously considered incomplete preferences as a solution, plausibly because these researchers accepted coherence arguments and so thought that agents with incomplete preferences were a non-starter.[1] But coherence arguments don’t work, so training agents to have incomplete preferences is back on the table as a strategy for reducing risks from AI. And (I think) it looks like a pretty good strategy. I make the case for it in this post, and my coauthors and I will soon be posting some experimental results suggesting that the strategy is promising.
As I wrote elsewhere: