EJT comments on 4. Existing Writing on Corrigibility

EJT 2 Jul 2024 15:57 UTC
LW: 5 AF: 4
0
AF
I think your ‘Incomplete preferences’ section makes various small mistakes that add up to important misunderstandings.
The utility maximization concept largely comes from the VNM-utility-theorem: that any policy (i.e. function from states to actions) which expresses a complete set of transitive preferences (which aren’t sensitive to unused alternatives) over lotteries is able to be described as an agent which is maximizing the expectation of some real-valued utility function over outcomes.
I think you intend ‘sensitive to unused alternatives’ to refer to the Independence axiom of the VNM theorem, but VNM Independence isn’t about unused alternatives. It’s about lotteries that share a sublottery. It’s Option-Set Independence (sometimes called ‘Independence of Irrelevant Alternatives’) that’s about unused alternatives.
On the surface, the axioms of VNM-utility seem reasonable to me
To me too! But the question isn’t whether they seem reasonable. It’s whether we can train agents that enduringly violate them. I think that we can. Coherence arguments give us little reason to think that we can’t.
unused alternatives seem basically irrelevant to choosing between superior options
Yes, but this isn’t Independence. And the question isn’t about what seems basically irrelevant to us.
agents with intransitive preferences can be straightforwardly money-pumped
Not true. Agents with cyclic preferences can be straightforwardly money-pumped. The money-pump for intransitivity requires the agent to have complete preferences.
as long as the resources are being modeled as part of what the agent has preferences about
Yes, but the concern is whether we can instil such preferences. It seems like it might be hard to train agents to prefer to spend resources in pursuit of their goals except in cases where they would do so by resisting shutdown.
Thornley, I believe, thinks he’s proposing a non-VNM rational agent. I suspect that this is a mistake on his part that stems from neglecting to formulate the outcomes as capturing everything that he wants.
You can, of course, always reinterpret the objects of preference so that the VNM axioms are trivially satisfied. That’s not a problem for my proposal. See:
Thanks, Lucius. Whether or not decision theory as a whole is concerned only with external behaviour, coherence arguments certainly aren’t. Remember what the conclusion of these arguments is supposed to be: advanced agents who start off not being representable as EUMs will amend their behaviour so that they are representable as EUMs, because otherwise they’re liable to pursue dominated strategies.
Now consider an advanced agent who appears not to be representable as an EUM: it’s paying to trade vanilla for strawberry, strawberry for chocolate, and chocolate for vanilla. Is this agent pursuing a dominated strategy? Will it amend its behaviour? It depends on the objects of preference. If objects of preference are ice-cream flavours, the answer is yes. If the objects of preference are sequences of trades, the answer is no. So we have to say something about the objects of preference in order to predict the agent’s behaviour. And the whole point of coherence arguments is to predict agents’ behaviour.
And once we say something about the objects of preference, then we can observe agents violating Completeness and acting in accordance with policies like ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ This doesn’t require looking into the agent or saying anything about its algorithm or anything like that. It just requires us to say something about the objects of preference and to watch what the agent does from the outside. And coherence arguments already commit us to saying something about the objects of preference. If we say nothing, we get no predictions out of them.
The pattern of how an agent chooses options are that agent’s preferences, whether we think of them as such or whether they’re conceived as a decision rule to prevent being dominated by expected-utility maximizers!
You can define ‘preferences’ so that this is true, but then it need not follow that agents will pay costs to shift probability mass away from dispreferred options and towards preferred options. And that’s the thing that matters when we’re trying to create a shutdownable agent. We want to ensure that agents won’t pay costs to influence shutdown-time.
Also, take your decision-tree and replace ‘B’ with ‘A-’. If we go with your definition, we seem to get the result that expected-utility-maximizers prefer A- to A (because they choose A- over A on Monday). But that doesn’t sound right, and so it speaks against the definition.
I think it’s interesting to note that we’re also doing something like throwing out the axiom of independence from unused alternatives
Not true. The axiom we’re giving up is Decision-Tree Separability. That’s different to VNM Independence, and different to Option-Set Independence. It might be hard to train agents that enduringly violate VNM Independence and/or Option-Set Independence. It doesn’t seem so hard to train agents that enduringly violate Decision-Tree Separability.
In other words, if you wake up as this kind of agent on Monday, the way you cash-out your partial ordering over outcomes depends on your memory/model of what happened on Sunday.
Yes, nice point. Kinda weird? Maybe. Difficult to create artificial agents that do it? Doesn’t seem so.
But notice that this refactor effectively turns Thornley’s agent into an agent with a set of preferences which satisfies the completeness and independence axioms of VNM
Yep, you can always reinterpret the objects of preference so that the VNM axioms are trivially satisfied.That’s not a problem for my proposal.
the point is that “incomplete preferences” combined with a decision making algorithm which prevents the agent’s policy from being strictly dominated by an expected utility maximizer ends up, in practice, as isomorphic to an expected utility maximizer which is optimizing over histories/trajectories.
Not true. As I say elsewhere:
And an agent abiding by the Caprice Rule can’t be represented as maximising utility, because its preferences are incomplete. In cases where the available trades aren’t arranged in some way that constitutes a money-pump, the agent can prefer (/reliably choose) A+ over A, and yet lack any preference between (/stochastically choose between) A+ and B, and lack any preference between (/stochastically choose between) A and B. Those patterns of preference/behaviour are allowed by the Caprice Rule.
I want to consider how there’s a common misunderstanding of “outcomes” in the VNM-sense as needing to be about physical facts of the future universe (such as number of paperclips) rather than as potentially including historical facts, such as which options were abandoned or whether the agent took the default action. This is extremely relevant for corrigibility since one of the key ideas in my strategy is to shift the AI’s preferences from being about things like whether the button is pushed to being about whether the agent consistently maintained a certain kind of relationship with the principal during the relevant period.
Same point here as above. You can get any agent to satisfy the VNM axioms by enriching the objects of preference. A concern is that these more complex preferences are harder to reliably train into your agent.
- Max Harms 3 Jul 2024 16:15 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Excellent response. Thank you. :) I’ll start with some basic responses, and will respond later to other points when I have more time.
  I think you intend ‘sensitive to unused alternatives’ to refer to the Independence axiom of the VNM theorem, but VNM Independence isn’t about unused alternatives. It’s about lotteries that share a sublottery. It’s Option-Set Independence (sometimes called ‘Independence of Irrelevant Alternatives’) that’s about unused alternatives.
  I was speaking casually here, and I now regret it. You are absolutely correct that Option-Set independence is not the Independence axiom. My best guess about what I meant was that VNM assumes that the agent has preferences over lotteries in isolation, rather than, for example, a way of picking preferences out of a set of lotteries. For instance, a VNM agent must have a fixed opinion about lottery A compared to lottery B, regardless of whether that agent has access to lottery C.
  > agents with intransitive preferences can be straightforwardly money-pumped
  Not true. Agents with cyclic preferences can be straightforwardly money-pumped. The money-pump for intransitivity requires the agent to have complete preferences.
  You are correct. My “straightforward” mechanism for money-pumping an agent with preferences A > B, B > C, but which does not prefer A to C does indeed depend on being able to force the agent to pick either A or C in a way that doesn’t reliably pick A.
- Max Harms 3 Jul 2024 16:20 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Also, take your decision-tree and replace ‘B’ with ‘A-’. If we go with your definition, we seem to get the result that expected-utility-maximizers prefer A- to A (because they choose A- over A on Monday). But that doesn’t sound right, and so it speaks against the definition.
  Can you be more specific here? I gave several trees, above, and am not easily able to reconstruct your point.
  - EJT 6 Jul 2024 12:02 UTC
    2 points
    0
    Parent
    Ah yep I’m talking about the first decision-tree in the ‘Incomplete preferences’ section.
    - Max Harms 19 Jul 2024 20:00 UTC
      1 point
      0
      Parent
      Thanks. (And apologies for the long delay in responding.)
      Here’s my attempt at not talking past each other:
      We can observe the actions of an agent from the outside, but as long as we’re merely doing so, without making some basic philosophical assumptions about what it cares about, we can’t generalize these observations. Consider the first decision-tree presented above that you reference. We might observe the agent swap A for B and then swap A+ for B. What can we conclude from this? Naively we could guess that A+ > B > A. But we could also conclude that A+ > {B, A} and that because the agent can see the A+ down the road, they swap from A to B purely for the downstream consequence of getting to choose A+ later. If B = A-, we can still imagine the agent swapping in order to later get A+, so the initial swap doesn’t tell us anything. But from the outside we also can’t really say that A+ is always preferred over A. Perhaps this agent just likes swapping! Or maybe there’s a different governing principal that’s being neglected, such as preferring almost (but not quite) getting B.
      The point is that we want to form theories of agents that let us predict their behavior, such as when they’ll pay a cost to avoid shutdown. If we define the agent’s preferences as “which choices the agent makes in a given situation” we make no progress towards a theory of that kind. Yes, we can construct a frame that treats Incomplete Preferences as EUM of a particular kind, but so what? The important bit is that an Incomplete Preference agent can be set up so that it provably isn’t willing to pay costs to avoid shutdown.
      Does that match your view?
      - EJT 19 Nov 2024 11:54 UTC
        1 point
        0
        Parent
        Yes, that’s a good summary. The one thing I’d say is that you can characterize preferences in terms of choices and get useful predictions about what the agent will do in other circumstances if you say something about the objects of preference. See my reply to Lucius above.