These topics (although not this exact question in its present form, AFAICT) have been discussed on LW over the past few years. A few highlights that seem relevant here:
Rohin Shah has explained that if the domain of definition of the utility function is the collection of universe-histories as opposed to the collection of (analyzed-at-present-time) world states, then a consequentialist agent can display any external behavior whatsoever. (Note that @johnswentworth has criticized the conclusions that are often reached as supposed corollaries of this, but he has done so in a comment to a now-deleted post that I can no longer link; the closest available link I have is this)
Steve Byrnes has distinguished preferences over future states and preferences over trajectories and has argued that future powerful AIs will have preferences over both distant-future world-states and other stuff like trajectories.
EJT has explained that “agents can make themselves immune to all possible money-pumps for completeness by acting in accordance with the following policy: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ Acting in accordance with this policy need never require an agent to act against any of their preferences.
My own question post from a couple months ago on the implications of coherence arguments for agentic behavior got a ton of engagement and contains other important links.
These topics (although not this exact question in its present form, AFAICT) have been discussed on LW over the past few years. A few highlights that seem relevant here:
Rohin Shah has explained that if the domain of definition of the utility function is the collection of universe-histories as opposed to the collection of (analyzed-at-present-time) world states, then a consequentialist agent can display any external behavior whatsoever. (Note that @johnswentworth has criticized the conclusions that are often reached as supposed corollaries of this, but he has done so in a comment to a now-deleted post that I can no longer link; the closest available link I have is this)
Steve Byrnes has distinguished preferences over future states and preferences over trajectories and has argued that future powerful AIs will have preferences over both distant-future world-states and other stuff like trajectories.
EJT has explained that “agents can make themselves immune to all possible money-pumps for completeness by acting in accordance with the following policy: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ Acting in accordance with this policy need never require an agent to act against any of their preferences.
My own question post from a couple months ago on the implications of coherence arguments for agentic behavior got a ton of engagement and contains other important links.
Do you remember the title of the post? I can probably link to a comment-only version of the post.
“Every system is equivalent to some utility maximizer. So, why are we still alive?”
(I think this links to Richard Kennaway’s comment on the post instead of Wentworth’s, but it was the link that I was most easily able to find)