(I already commented on parts of this post in this comment
elsewhere,
the first and fourth paragaph below copy text from there.)
My first impression is that your
concept
of VNM-incoherence is only weakly related to the meaning that Eliezer
has in mind when he uses the term incoherence. In my view, the four
axioms of VNM-rationality
have only a very weak descriptive and constraining power when it comes
to defining rational behavior.
I believe that Eliezer’s notion of rationality, and therefore his
notion of coherence above, goes far beyond that implied by the axioms of
VNM-rationality. My feeling is that Eliezer is using the term
‘coherence constraints’ an intuituon-pump, in a meaning where coherence implies, or almost
always implies, that a coherent agent will develop the incentive to
self-preserve.
While are using math to disambiguate some properties of corrigibility
above (yay!), you are not necessarily disambiguating Eliezer.
Maybe I am reading your post wrong: I am reading it as an effort to
apply the axioms of VNM-rationality to define a notion you call
VNM-incoherence. But maybe VN and M defined a notion of
coherence not related to their rationality axioms. a version of coherence I cannot find
on the Wikipedia page—if so please tell me.
I am having trouble telling exactly how you
are defining VNM-incoherence. You seem to be toying with
several alternative definitions, one where it applies to reward
functions (or preferences over lotteries) which are only allowed to
examine the final state in a 10-step trajectory, another where the
reward function can also examine/score the entire trajectory and maybe the
actions taken to produce that trajectory. I think that your proof
only works in the first case, but fails in the second case.
When it comes to a multi-time-step agent, I guess there are two ways
to interpret the notion of ‘outcome’ in VNM theory: the outcome is
either the system state obtained after the last time step, or the entire
observable trajectory of events over all time steps.
As for what you prove above, I would phrase the statement being proven
as follows. If you want to force a utility-maximising agent to adopt a
corrigible policy by defining its utility function, then it is not
always sufficient to define a utility function that evaluates the final
state along its trajectory only. The counter-example given shows
that, if you only reference the final state, you cannot construct a
utility function that will score πnotcorrigible and
πcorrect differently.
The corollary is: if you want to create a certain type of
corrigibility via terms you add to the utility function of a
utility-maximising agent, you will often need to define a utility
function that evaluates the entire trajectory, maybe including the
specific actions taken, not just the end state. The default model of an
MDP reward function, the one where the function is applied to each state transition along the trajectory, will usually let you do that. You mention:
I don’t think this is a deep solution to corrigibility (as defined here), but rather a hacky prohibition.
I’d claim that you have proven that you actually might need such
hacky prohibitions to solve corrigibility in the general case.
To echo some of the remarks made by tailcalled: maybe this is not surprising, as human values are often as much about the
journey as about the destination. This seems to apply to
corrigibility. The human value that corrigibility expresses does not
in fact express a preference ordering on the final states an agent
will reach: on the contrary it expresses a preference ordering
among the methods that the agent will use to get there.
(I already commented on parts of this post in this comment elsewhere, the first and fourth paragaph below copy text from there.)
My first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer’s notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term ‘coherence constraints’ an intuituon-pump, in a meaning where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.
While are using math to disambiguate some properties of corrigibility above (yay!), you are not necessarily disambiguating Eliezer.
Maybe I am reading your post wrong: I am reading it as an effort to apply the axioms of VNM-rationality to define a notion you call VNM-incoherence. But maybe VN and M defined a notion of coherence not related to their rationality axioms. a version of coherence I cannot find on the Wikipedia page—if so please tell me.
I am having trouble telling exactly how you are defining VNM-incoherence. You seem to be toying with several alternative definitions, one where it applies to reward functions (or preferences over lotteries) which are only allowed to examine the final state in a 10-step trajectory, another where the reward function can also examine/score the entire trajectory and maybe the actions taken to produce that trajectory. I think that your proof only works in the first case, but fails in the second case.
When it comes to a multi-time-step agent, I guess there are two ways to interpret the notion of ‘outcome’ in VNM theory: the outcome is either the system state obtained after the last time step, or the entire observable trajectory of events over all time steps.
As for what you prove above, I would phrase the statement being proven as follows. If you want to force a utility-maximising agent to adopt a corrigible policy by defining its utility function, then it is not always sufficient to define a utility function that evaluates the final state along its trajectory only. The counter-example given shows that, if you only reference the final state, you cannot construct a utility function that will score πnotcorrigible and πcorrect differently.
The corollary is: if you want to create a certain type of corrigibility via terms you add to the utility function of a utility-maximising agent, you will often need to define a utility function that evaluates the entire trajectory, maybe including the specific actions taken, not just the end state. The default model of an MDP reward function, the one where the function is applied to each state transition along the trajectory, will usually let you do that. You mention:
I’d claim that you have proven that you actually might need such hacky prohibitions to solve corrigibility in the general case.
To echo some of the remarks made by tailcalled: maybe this is not surprising, as human values are often as much about the journey as about the destination. This seems to apply to corrigibility. The human value that corrigibility expresses does not in fact express a preference ordering on the final states an agent will reach: on the contrary it expresses a preference ordering among the methods that the agent will use to get there.