I’m interested in hearing about how your approach handles this environment, because I think I’m getting lost in informal assumptions and symbol-grounding issues when reading about your proposed method.
Read your post, here are my initial impressions on how it relates to
the discussion here.
In your post, you aim to develop a crisp mathematical definition of
(in)coherence, i.e. VNM-incoherence. I like that, looks like a good
way to move forward. Definitely, developing the math further has been
my own approach to de-confusing certain intuitive notions about what
should be possible or not with corrigibility.
However, my first impression is that your
concept
of VNM-incoherence is only weakly related to the meaning that Eliezer
has in mind when he uses the term incoherence. In my view, the four
axioms of
VNM-rationality
have only a very weak descriptive and constraining power when it comes
to defining rational behavior.
I believe that Eliezer’s notion of rationality, and therefore his
notion of coherence above, goes far beyond that implied by the axioms of
VNM-rationality. My feeling is that Eliezer is using the term
‘coherence constraints’ an intuition-pump way where coherence implies, or almost
always implies, that a coherent agent will develop the incentive to
self-preserve.
Looking at your post, I am also having trouble telling exactly how you
are defining VNM-incoherence. You seem to be toying with
several alternative definitions, one where it applies to reward
functions (or preferences over lotteries) which are only allowed to
examine the final state in a 10-step trajectory, another where the
reward function can examine the entire trajectory and maybe the
actions taken to produce that trajectory. I think that your proof
only works in the first case, but fails in the second case. This has
certain (fairly trivial) corollaries about building
corrigibility. I’ll expand on this in a comment I plan to attach to
your post.
I’m interested in hearing about how your approach handles this environment,
I think one way to connect your ABC toy environment to my
approach is to look at sections 3 and 4 of my earlier
paper where I develop a somewhat
similar clarifying toy environment, with running code.
Another comment I can make is that your ABC nodes-and-arrows state
transition diagram is a depiction which makes it hard see how to apply
my approach, because the depiction mashes up the state of the world
outside of the compute core and the state of the world inside the
compute core. If you want to apply counterfactual planning, or if you
want to have a an agent design that can compute the balancing function terms according to Armstrong’s
indifference approach, you need a different depiction of your setup.
You need one which separates out these two state components more explicitely. For example,
make an MDP model where the individual states are instances of the
tuple (physical position of agent in the ABC playing field,policy
function loaded into the compute core).
Not sure how to interpret your statement that you got lost in
symbol-grounding issues. If you can expand on this, I might be able
to help.
Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.
When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:
In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don’t see that fact mentioned often on this forum, so I will expand.
An agent that plans coherently given a reward function Rp to maximize paperclips will be an incoherent planner if you judge its actions by a reward function Rs that values the maximization of staples instead.
To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.
Apparently no one has actually shown that corrigibility can be VNM-incoherent in any precise sense (and not in the hand-wavy sense which is good for intuition-pumping). I went ahead and sketched out a simple proof of how a reasonable kind of corrigibility gives rise to formal VNM incoherence.
I’m interested in hearing about how your approach handles this environment, because I think I’m getting lost in informal assumptions and symbol-grounding issues when reading about your proposed method.
Read your post, here are my initial impressions on how it relates to the discussion here.
In your post, you aim to develop a crisp mathematical definition of (in)coherence, i.e. VNM-incoherence. I like that, looks like a good way to move forward. Definitely, developing the math further has been my own approach to de-confusing certain intuitive notions about what should be possible or not with corrigibility.
However, my first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer’s notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term ‘coherence constraints’ an intuition-pump way where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.
Looking at your post, I am also having trouble telling exactly how you are defining VNM-incoherence. You seem to be toying with several alternative definitions, one where it applies to reward functions (or preferences over lotteries) which are only allowed to examine the final state in a 10-step trajectory, another where the reward function can examine the entire trajectory and maybe the actions taken to produce that trajectory. I think that your proof only works in the first case, but fails in the second case. This has certain (fairly trivial) corollaries about building corrigibility. I’ll expand on this in a comment I plan to attach to your post.
I think one way to connect your ABC toy environment to my approach is to look at sections 3 and 4 of my earlier paper where I develop a somewhat similar clarifying toy environment, with running code.
Another comment I can make is that your ABC nodes-and-arrows state transition diagram is a depiction which makes it hard see how to apply my approach, because the depiction mashes up the state of the world outside of the compute core and the state of the world inside the compute core. If you want to apply counterfactual planning, or if you want to have a an agent design that can compute the balancing function terms according to Armstrong’s indifference approach, you need a different depiction of your setup. You need one which separates out these two state components more explicitely. For example, make an MDP model where the individual states are instances of the tuple (physical position of agent in the ABC playing field,policy function loaded into the compute core).
Not sure how to interpret your statement that you got lost in symbol-grounding issues. If you can expand on this, I might be able to help.
Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.
When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:
To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.