Note: I get super redundant after like the first reply, so watch out for that. I’m not trying to be an asshole or anything; I’m just attempting to respond to your main point from every possible angle.
For the purposes of discussion on this site, a Friendly AI is assumed to be one that shares our terminal values.
What’s a “terminal value”?
My utility function assigns value to the desires of beings whose values conflict with my own.
Even for somebody trying to kill you for fun?
I can’t allow other values to supersede mine, but absent other considerations, I have to assign negative utility in my own function for creating negative utility in the functions of other existing beings.
What exactly would those “other considerations” be?
I have to assign negative utility in my own function for creating negative utility in the functions of other existing beings.
Would you be comfortable being a part of putting somebody in jail for murdering your best friend (whoever that is)?
I’m skeptical that an AI that would impose catastrophe on other thinking beings is really maximizing my utility.
What if somebody were to build an AI for hunting down and incarcerating murderers?
Would that “maximize your utility”, or would you be uncomfortable with the fact that it would be “imposing catastrophe” on beings “whose desires conflict with [your] own”?
It seems to me that to truly maximize my utility, an AI would need to have consideration for the utility of other beings.
What if the “terminal values” (assuming that I know what you mean by that) of those beings made killing you (for laughs!) a great way to “maximize their utility”?
Perhaps my utility function gives more value than most to beings that don’t share my values
But does that extraordinary consideration stretch to the people bent on killing other people for fun?
However, if an AI imposes truly catastrophic fates on other intelligent beings, my own utility function takes such a hit that I cannot consider it friendly.
Would your utility function take that hit if an AI saved your best friend from one of those kind of people (the ones who like to kill other people for laughs)?
Roughly, a terminal value is a thing you value for its own sake.
This is contrasted with instrumental values, which are things you value only because they provide a path to terminal values.
For example: money, on this view, is something we value only instrumentally… having large piles of money with no way to spend it isn’t actually what anyone wants.
The linked comment seems to be questioning whether terminal values are stable or unambiguous, rather than whether they exist. Unless the ambiguity goes so deep as to make the values meaningless … but that seems far-fetched.
What exactly would those “other considerations” be?
Things like me getting killed in the course of satisfying their utility functions, as you mentioned above, would be a big one.
Would you be comfortable being a part of putting somebody in jail for murdering your best friend (whoever that is)?
I support a system where we precommit to actions such as imprisoning people who commit crimes to prevent them from committing crimes in the first place. My utlity function doesn’t get positive value out of retribution against them. If an AI that hunts down and incarcerates murderers is better at preventing people from murdering in the first place, I would be in favor of it, assuming no unforseen side effects.
Things like me getting killed in the course of satisfying their utility functions, as you mentioned above, would be a big one.
So basically your “utility function assigns value to the desires of beings whose values conflict with your own” unless they really conflict with your own (such as get you killed in the process)?
I assign utility to their values even if they conflict with mine to such a great degree, but I have to measure that against the negative utility they impose on me.
I assign utility to their values even if they conflict with mine to such a great degree, but I have to measure that against the negative utility they impose on me.
So, as to the example, you would value that they want to kill you somewhat, but you would value not dying even more?
That’s my understanding of what I value, at least.
Well, I’m not so sure that those words (the ones that I used to summarize your position) even mean anything.
How could you value them wanting to kill you somewhat (which would be you feeling some desire while cycling through a few different instances of imagining them doing something leading to you dying), but also value you not dying even more (which would be you feeling even more desire while moving through a few different examples of imagining you being alive)?
It would be like saying that you value going to the store somewhat (which would be you feeling some desire while cycling through a few different instances of imagining yourself traveling to the store and getting there), but value not actually being at the store (which would be you feeling even more desire while moving through a few different examples of not being at the store). But would that make sense? Do those words (the ones making up the first sentence of this paragraph) even mean anything? Or are they just nonsense?
Simply put, would it make sense to say that somebody could value X+Y (where the addition sign refers to adding the first event to the second in a sequence), but not Y (which is a part of X+Y, which the person apparently likes)?
As you already pointed out to TheOtherDave, we have multiple values which can conflict with each other. Maximally fulfilling one value can lead to low utility as it creates negative utility according to other values. I have a general desire to fulfill the utility functions of others, but sometimes this creates negative utility according to my other values.
Simply put, could you value X+Y (where the addition sign refers to adding the first event to the second in a sequence), but not Y?
Unless I’m misunderstanding you, yes. Y could have zero or negative utility, the positive utility of X could be great enough that the addition of the two would have positive overall utility.
E.g. you could satisfy both values by helping build a (non-sentient) simulation through which they can satisfy their desire to kill you without actually killing you.
But really I think the problem is that when we refer to individual actions as if they’re terminal values, it’s difficult to compromise—true terminal values tend however to be more personal than that.
Note: I get super redundant after like the first reply, so watch out for that. I’m not trying to be an asshole or anything; I’m just attempting to respond to your main point from every possible angle.
What’s a “terminal value”?
Even for somebody trying to kill you for fun?
What exactly would those “other considerations” be?
Would you be comfortable being a part of putting somebody in jail for murdering your best friend (whoever that is)?
What if somebody were to build an AI for hunting down and incarcerating murderers?
Would that “maximize your utility”, or would you be uncomfortable with the fact that it would be “imposing catastrophe” on beings “whose desires conflict with [your] own”?
What if the “terminal values” (assuming that I know what you mean by that) of those beings made killing you (for laughs!) a great way to “maximize their utility”?
But does that extraordinary consideration stretch to the people bent on killing other people for fun?
Would your utility function take that hit if an AI saved your best friend from one of those kind of people (the ones who like to kill other people for laughs)?
Roughly, a terminal value is a thing you value for its own sake.
This is contrasted with instrumental values, which are things you value only because they provide a path to terminal values.
For example: money, on this view, is something we value only instrumentally… having large piles of money with no way to spend it isn’t actually what anyone wants.
Caveat: I should clarify that I am not sure terminal values actually exist, personally.
The linked comment seems to be questioning whether terminal values are stable or unambiguous, rather than whether they exist. Unless the ambiguity goes so deep as to make the values meaningless … but that seems far-fetched.
Hm… maybe. Certainly my understanding of the concept of terminal values includes stability, so I haven’t drawn that distinction much in my thinking.
That said, I don’t quite see how considering them distinct resolves any of my concerns. Can you expand on your thinking here?
Wikipedia has an article on it.
Things like me getting killed in the course of satisfying their utility functions, as you mentioned above, would be a big one.
I support a system where we precommit to actions such as imprisoning people who commit crimes to prevent them from committing crimes in the first place. My utlity function doesn’t get positive value out of retribution against them. If an AI that hunts down and incarcerates murderers is better at preventing people from murdering in the first place, I would be in favor of it, assuming no unforseen side effects.
So basically your “utility function assigns value to the desires of beings whose values conflict with your own” unless they really conflict with your own (such as get you killed in the process)?
I assign utility to their values even if they conflict with mine to such a great degree, but I have to measure that against the negative utility they impose on me.
So, as to the example, you would value that they want to kill you somewhat, but you would value not dying even more?
That’s my understanding of what I value, at least.
Well, I’m not so sure that those words (the ones that I used to summarize your position) even mean anything.
How could you value them wanting to kill you somewhat (which would be you feeling some desire while cycling through a few different instances of imagining them doing something leading to you dying), but also value you not dying even more (which would be you feeling even more desire while moving through a few different examples of imagining you being alive)?
It would be like saying that you value going to the store somewhat (which would be you feeling some desire while cycling through a few different instances of imagining yourself traveling to the store and getting there), but value not actually being at the store (which would be you feeling even more desire while moving through a few different examples of not being at the store). But would that make sense? Do those words (the ones making up the first sentence of this paragraph) even mean anything? Or are they just nonsense?
Simply put, would it make sense to say that somebody could value X+Y (where the addition sign refers to adding the first event to the second in a sequence), but not Y (which is a part of X+Y, which the person apparently likes)?
As you already pointed out to TheOtherDave, we have multiple values which can conflict with each other. Maximally fulfilling one value can lead to low utility as it creates negative utility according to other values. I have a general desire to fulfill the utility functions of others, but sometimes this creates negative utility according to my other values.
Unless I’m misunderstanding you, yes. Y could have zero or negative utility, the positive utility of X could be great enough that the addition of the two would have positive overall utility.
E.g. you could satisfy both values by helping build a (non-sentient) simulation through which they can satisfy their desire to kill you without actually killing you.
But really I think the problem is that when we refer to individual actions as if they’re terminal values, it’s difficult to compromise—true terminal values tend however to be more personal than that.