EJT answers What do coherence arguments actually prove about agentic behavior?

EJT Jun 18, 2024, 4:41 PM
6 points
7
I’m coming to this two weeks late, but here are my thoughts.
The question of interest is:
- Will sufficiently-advanced artificial agents be representable as maximizing expected utility?
Rephrased:
- Will sufficiently-advanced artificial agents satisfy the VNM axioms (Completeness, Transitivity, Independence, and Continuity)?
Coherence arguments purport to establish that the answer is yes. These arguments go like this:
1. There exist theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.
2. Sufficiently-advanced artificial agents will not pursue dominated strategies.
3. So, sufficiently-advanced artificial agents will be representable as maximizing expected utility.
These arguments don’t work, because premise 1 is false: there are no theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies. In the year since I published my post, no one has disputed that.
Now to address two prominent responses:
‘I define ‘coherence theorems’ differently.’
In the post, I used the term ‘coherence theorems’ to refer to ‘theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.’ I took that to be the usual definition on LessWrong (see the Appendix for why), but some people replied that they meant something different by ‘coherence theorems’: e.g. ‘theorems that are relevant to the question of agent coherence.’
All well and good. If you use that definition, then there are coherence theorems. But if you use that definition, then coherence theorems can’t play the role that they’re supposed to play in coherence arguments. Premise 1 of the coherence argument is still false. That’s the important point.
‘The mistake is benign.’
This is a crude summary of Rohin’s response. Rohin and I agree that the Complete Class Theorem implies the following: ‘If an agent has complete and transitive preferences, then unless the agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.’ So the mistake is neglecting to say ‘If an agent has complete and transitive preferences…’ Rohin thinks this mistake is benign.
I don’t think the mistake is benign. As my rephrasing of the question of interest above makes clear, Completeness and Transitivity are a major part of what coherence arguments aim to establish! So it’s crucial to note that the Complete Class Theorem gives us no reason to think that sufficiently-advanced artificial agents will have complete or transitive preferences, especially since:
- Completeness doesn’t come for free.
- Money-pump arguments for Completeness (applied to artificial agents) aren’t convincing.
- Money-pump arguments for Transitivity assume Completeness.
- Training agents to violate Completeness might keep them shutdownable.
Two important points
Here are two important points, which I make to preclude misreadings of the post:
- Future artificial agents—trained in a standard way—might still be representable as maximizing expected utility.
Coherence arguments don’t work, but there might well be other reasons to think that future artificial agents—trained in a standard way—will be representable as maximizing expected utility.
- Artificial agents not representable as maximizing expected utility can still be dangerous.
So why does the post matter?
The post matters because ‘train artificial agents to have incomplete preferences’ looks promising as a way of ensuring that these agents allow us to shut them down.
AI safety researchers haven’t previously considered incomplete preferences as a solution, plausibly because these researchers accepted coherence arguments and so thought that agents with incomplete preferences were a non-starter.^[1] But coherence arguments don’t work, so training agents to have incomplete preferences is back on the table as a strategy for reducing risks from AI. And (I think) it looks like a pretty good strategy. I make the case for it in this post, and my coauthors and I will soon be posting some experimental results suggesting that the strategy is promising.
1. ^
  As I wrote elsewhere:
  The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI failed because they only considered agents with complete preferences. Useful agents with complete (and transitive and option-set-independent) preferences will often have some preference regarding the pressing of the shutdown button, as this theorem shows. MIRI thought that they had to assume completeness, because of coherence arguments. But coherence arguments are mistaken: there are no theorems which imply that agents must have complete preferences in order to avoid pursuing dominated strategies. So we can relax the assumption of completeness and use this extra leeway to find a formula for a corrigible consequentialist. That formula is what I purport to give in this post.
- Jeremy Gillen Jun 19, 2024, 3:52 AM
  2 points
  0
  Parent
  I find the money pump argument for completeness to be convincing.
  The rule that you provide as a counterexample (Caprice rule) is one that gradually completes the preferences of the agent as it encounters a variety of decisions. You appear to agree with that this is the case. This isn’t a large problem for your argument. The big problem is that when there are lots of random nodes in the decision tree, such that the agent might encounter a wide variety of potentially money-pumping trades, the agent needs to complete its preferences in advance, or risk its strategy being dominated.
  You argue with John about this here, and John appears to have dropped the argument. It looks to me like your argument there is wrong, at least when it comes to situations where there are sufficient assumptions to talk about coherence (which is when the preferences are over final outcomes, rather than trajectories).
  - EJT Jun 19, 2024, 4:01 PM
    3 points
    0
    Parent
    I take the ‘lots of random nodes’ possibility to be addressed by this point:
    And this point generalises to arbitrarily complex/realistic decision trees, with more choice-nodes, more chance-nodes, and more options. Agents with a model of future trades can use their model to predict what they’d do conditional on reaching each possible choice-node, and then use those predictions to determine the nature of the options available to them at earlier choice-nodes. The agent’s model might be defective in various ways (e.g. by getting some probabilities wrong, or by failing to predict that some sequences of trades will be available) but that won’t spur the agent to change its preferences, because the dilemma from my previous comment recurs: if the agent is aware that some lottery is available, it won’t choose any dispreferred lottery; if the agent is unaware that some lottery is available and chooses a dispreferred lottery, the agent’s lack of awareness means it won’t be spurred by this fact to change its preferences. To get over this dilemma, you still need the ‘non-myopic optimiser deciding the preferences of a myopic agent’ setting, and my previous points apply: results from that setting don’t vindicate coherence arguments, and we humans as non-myopic optimisers could decide to create artificial agents with incomplete preferences.
    Can you explain why you think that doesn’t work?
    To elaborate a little more, introducing random nodes allows for the possibility that the agent ends up with some outcome that they disprefer to the outcome that they would have gotten (as a matter of fact, unbeknownst to the agent) by making different choices. But that’s equally true of agents with complete preferences.
    - Jeremy Gillen Jun 20, 2024, 1:38 AM
      2 points
      0
      Parent
      I intended for my link to point to the comment you linked to, oops.
      I’ve responded here, I think it’s better to just keep one thread of argument, in a place where there is more necessary context.

EJT answers What do coherence arguments actually prove about agentic behavior?

‘I define ‘coherence theorems’ differently.’

‘The mistake is benign.’

Two important points

So why does the post matter?