Koen.Holtman comments on A Certain Formalization of Corrigibility Is VNM-Incoherent

Koen.Holtman Nov 21, 2021, 6:14 PM
LW: 1 AF: 1
AF
(I already commented on parts of this post in this comment elsewhere, the first and fourth paragaph below copy text from there.)

My first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer’s notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term ‘coherence constraints’ an intuituon-pump, in a meaning where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.

While are using math to disambiguate some properties of corrigibility above (yay!), you are not necessarily disambiguating Eliezer.

Maybe I am reading your post wrong: I am reading it as an effort to apply the axioms of VNM-rationality to define a notion you call VNM-incoherence. But maybe VN and M defined a notion of coherence not related to their rationality axioms. a version of coherence I cannot find on the Wikipedia page—if so please tell me.

I am having trouble telling exactly how you are defining VNM-incoherence. You seem to be toying with several alternative definitions, one where it applies to reward functions (or preferences over lotteries) which are only allowed to examine the final state in a 10-step trajectory, another where the reward function can also examine/score the entire trajectory and maybe the actions taken to produce that trajectory. I think that your proof only works in the first case, but fails in the second case.

When it comes to a multi-time-step agent, I guess there are two ways to interpret the notion of ‘outcome’ in VNM theory: the outcome is either the system state obtained after the last time step, or the entire observable trajectory of events over all time steps.

As for what you prove above, I would phrase the statement being proven as follows. If you want to force a utility-maximising agent to adopt a corrigible policy by defining its utility function, then it is not always sufficient to define a utility function that evaluates the final state along its trajectory only. The counter-example given shows that, if you only reference the final state, you cannot construct a utility function that will score $π_{n o t c o r r i g i b l e}$ and $π_{c o r r e c t}$ differently.

The corollary is: if you want to create a certain type of corrigibility via terms you add to the utility function of a utility-maximising agent, you will often need to define a utility function that evaluates the entire trajectory, maybe including the specific actions taken, not just the end state. The default model of an MDP reward function, the one where the function is applied to each state transition along the trajectory, will usually let you do that. You mention:

I don’t think this is a deep solution to corrigibility (as defined here), but rather a hacky prohibition.

I’d claim that you have proven that you actually might need such hacky prohibitions to solve corrigibility in the general case.

To echo some of the remarks made by tailcalled: maybe this is not surprising, as human values are often as much about the journey as about the destination. This seems to apply to corrigibility. The human value that corrigibility expresses does not in fact express a preference ordering on the final states an agent will reach: on the contrary it expresses a preference ordering among the methods that the agent will use to get there.