The world of potential preferences is a super-high dimensional space, of which what I care about is only a tiny subset (though still complex in its own right).
Taking actions or offers which would improve the world according to my preferences has both absolute cost and opportunity cost, meaning I only take actions with at least some threshold positive impact.
Preferences outside my core domain of action are:
generally very weak at best, with total indifference being the norm, and
chaotically noisy, such that they may vary according to all kinds of situational characteristics unpredictable to myself or an outside observer
Also, my understanding of how my actions affect those areas I truly care about is sufficiently imperfect, that outside of a well understood range the expected value is low, especially with conservative preferences.
These factors greatly reduce the scope of potential dutch books which I would ever actually take, reducing the ability of somebody to exploit any inconsistencies.
Also, repeated exposure to simple of dutch books and failures to maximize are likely
We should expect therefore to find agents which are utility-maximizing only as a contingent outcome of the trajectory of their learning to display utility-maximizing behaviors in areas that are both reward-relevant, within the training domain.
While self-modification to create a simple smooth plain is a plausible action, it shouldn’t be seen as dominant since less drastic actions are likely to be sufficient to avoid being dutch booked.
A major crux for this view as applied to systems like humans is the explanation of how our (relatively) simple, compressible goals and ideas emerge out of the outrageous complexity of our minds. My feeling is that, once learning has started, adjustments to the nature of the mind pick up broad contours as a way to act but only as a reflection of the world and reward system in which they are placed. If instead there is some kind of core underlying drive towards logical simplicity, or that a logically simple set of drives, once in place, is some how dominant or tends to spread through a network, then I would expect smarter agents to quickly become for agent-like.
Reasoning a little less poetically:
The world of potential preferences is a super-high dimensional space, of which what I care about is only a tiny subset (though still complex in its own right).
Taking actions or offers which would improve the world according to my preferences has both absolute cost and opportunity cost, meaning I only take actions with at least some threshold positive impact.
Preferences outside my core domain of action are:
generally very weak at best, with total indifference being the norm, and
chaotically noisy, such that they may vary according to all kinds of situational characteristics unpredictable to myself or an outside observer
Also, my understanding of how my actions affect those areas I truly care about is sufficiently imperfect, that outside of a well understood range the expected value is low, especially with conservative preferences.
These factors greatly reduce the scope of potential dutch books which I would ever actually take, reducing the ability of somebody to exploit any inconsistencies.
Also, repeated exposure to simple of dutch books and failures to maximize are likely
We should expect therefore to find agents which are utility-maximizing only as a contingent outcome of the trajectory of their learning to display utility-maximizing behaviors in areas that are both reward-relevant, within the training domain.
While self-modification to create a simple smooth plain is a plausible action, it shouldn’t be seen as dominant since less drastic actions are likely to be sufficient to avoid being dutch booked.
A major crux for this view as applied to systems like humans is the explanation of how our (relatively) simple, compressible goals and ideas emerge out of the outrageous complexity of our minds. My feeling is that, once learning has started, adjustments to the nature of the mind pick up broad contours as a way to act but only as a reflection of the world and reward system in which they are placed. If instead there is some kind of core underlying drive towards logical simplicity, or that a logically simple set of drives, once in place, is some how dominant or tends to spread through a network, then I would expect smarter agents to quickly become for agent-like.