Imagine taking someone’s utility function, and inverting it by flipping the sign on all evaluations. What might this actually look like? Well, if previously I wanted a universe filled with happiness, now I’d want a universe filled with suffering; if previously I wanted humanity to flourish, now I want it to decline.
But this is assuming a Cartesian utility function. Once we treat ourselves as embedded agents, things get trickier. For example, suppose that I used to want people with similar values to me to thrive, and people with different values from me to suffer. Now if my utility function is flipped, that naively means that I want people similar to me to suffer, and people similar to me to thrive. But this has a very different outcome if we interpret “similar to me” as de dicto vs de re—i.e. whether it refers to the old me or the new me.
This is a more general problem when one person’s utility function can depend on another person’s, where you can construct circular dependencies (which I assume you can also do in the utility-flipping case). There’s probably been a bunch of work on this, would be interested in pointers to it (e.g. I assume there have been attempts to construct type systems for utility functions, or something like that).
(This note inspired by Mad Investor Chaos, where (SPOILERS) one god declines to take revenge, because they’re the utility-flipped version of another god who would have taken revenge. At first this made sense, but now I feel like it’s not type-safe.)
Actually, this raises a more general point (can’t remember if I’ve made this before): we’ve evolved some values (like caring about revenge) because they’re game-theoretically useful. But if game theory says to take revenge, and also our values say to take revenge, then this is double-counting. So I’d guess that, for much more coherent agents, their level of vengefulness would mainly be determined by their decision theories (which can’t be flipped) rather than their utilities.
Fundamentally, humans aren’t VNM-rational, and don’t actually have utility functions. Which makes the thought experiment much less fun. If you recast it as “what if a human brain’s reinforcement mechanisms were reversed”, I suspect it’s also boring: simple early death.
The interesting fictional cases are when some subset of a person’s legible motivations are reversed, but the mass of other drives remain. This very loosely maps to reversing terminal goals and re-calculating instrumental goals—they may reverse, stay, or change in weird ways.
The indirection case is solved (or rather unasked) by inserting a “perceived” in the calculation chain. Your goals don’t depend on similarity to you, they depend on your perception (or projection) of similarity to you.
I have been asking a similar question for a long time. This is similar to the standard problem that if we deny regularity, will it be regular irregularity or irregular irregularity, that is, at what level are we denying the phenomeno? And only at one level?
Imagine taking someone’s utility function, and inverting it by flipping the sign on all evaluations. What might this actually look like? Well, if previously I wanted a universe filled with happiness, now I’d want a universe filled with suffering; if previously I wanted humanity to flourish, now I want it to decline.
But this is assuming a Cartesian utility function. Once we treat ourselves as embedded agents, things get trickier. For example, suppose that I used to want people with similar values to me to thrive, and people with different values from me to suffer. Now if my utility function is flipped, that naively means that I want people similar to me to suffer, and people similar to me to thrive. But this has a very different outcome if we interpret “similar to me” as de dicto vs de re—i.e. whether it refers to the old me or the new me.
This is a more general problem when one person’s utility function can depend on another person’s, where you can construct circular dependencies (which I assume you can also do in the utility-flipping case). There’s probably been a bunch of work on this, would be interested in pointers to it (e.g. I assume there have been attempts to construct type systems for utility functions, or something like that).
(This note inspired by Mad Investor Chaos, where (SPOILERS) one god declines to take revenge, because they’re the utility-flipped version of another god who would have taken revenge. At first this made sense, but now I feel like it’s not type-safe.)
Actually, this raises a more general point (can’t remember if I’ve made this before): we’ve evolved some values (like caring about revenge) because they’re game-theoretically useful. But if game theory says to take revenge, and also our values say to take revenge, then this is double-counting. So I’d guess that, for much more coherent agents, their level of vengefulness would mainly be determined by their decision theories (which can’t be flipped) rather than their utilities.
Fundamentally, humans aren’t VNM-rational, and don’t actually have utility functions. Which makes the thought experiment much less fun. If you recast it as “what if a human brain’s reinforcement mechanisms were reversed”, I suspect it’s also boring: simple early death.
The interesting fictional cases are when some subset of a person’s legible motivations are reversed, but the mass of other drives remain. This very loosely maps to reversing terminal goals and re-calculating instrumental goals—they may reverse, stay, or change in weird ways.
The indirection case is solved (or rather unasked) by inserting a “perceived” in the calculation chain. Your goals don’t depend on similarity to you, they depend on your perception (or projection) of similarity to you.
I have been asking a similar question for a long time. This is similar to the standard problem that if we deny regularity, will it be regular irregularity or irregular irregularity, that is, at what level are we denying the phenomeno? And only at one level?