EJT comments on 4. Existing Writing on Corrigibility

EJT 6 Jul 2024 12:02 UTC
2 points
0
Ah yep I’m talking about the first decision-tree in the ‘Incomplete preferences’ section.
- Max Harms 19 Jul 2024 20:00 UTC
  1 point
  0
  Parent
  Thanks. (And apologies for the long delay in responding.)
  Here’s my attempt at not talking past each other:
  We can observe the actions of an agent from the outside, but as long as we’re merely doing so, without making some basic philosophical assumptions about what it cares about, we can’t generalize these observations. Consider the first decision-tree presented above that you reference. We might observe the agent swap A for B and then swap A+ for B. What can we conclude from this? Naively we could guess that A+ > B > A. But we could also conclude that A+ > {B, A} and that because the agent can see the A+ down the road, they swap from A to B purely for the downstream consequence of getting to choose A+ later. If B = A-, we can still imagine the agent swapping in order to later get A+, so the initial swap doesn’t tell us anything. But from the outside we also can’t really say that A+ is always preferred over A. Perhaps this agent just likes swapping! Or maybe there’s a different governing principal that’s being neglected, such as preferring almost (but not quite) getting B.
  The point is that we want to form theories of agents that let us predict their behavior, such as when they’ll pay a cost to avoid shutdown. If we define the agent’s preferences as “which choices the agent makes in a given situation” we make no progress towards a theory of that kind. Yes, we can construct a frame that treats Incomplete Preferences as EUM of a particular kind, but so what? The important bit is that an Incomplete Preference agent can be set up so that it provably isn’t willing to pay costs to avoid shutdown.
  Does that match your view?
  - EJT 19 Nov 2024 11:54 UTC
    1 point
    0
    Parent
    Yes, that’s a good summary. The one thing I’d say is that you can characterize preferences in terms of choices and get useful predictions about what the agent will do in other circumstances if you say something about the objects of preference. See my reply to Lucius above.