Similarly, if our process of extrapolating human values have local stopping criteria, there’s no limit to how bad they could end up being, or how “far away” in the space of values they could go.
I feel like there’s a distinct difference between “human values could end up arbitrarily distant from current ones” and “human values could end up arbitrarily bad”.
In that, I feel that I have certain values that I care a lot about (for example, wanting there to be less intense suffering in the world) and which I wouldn’t want to change; and also other values for which I don’t care about how much they’d happen to drift.
If you think of this as my important values being points on dimension x1-x10, and my non-important values as being points on dimensions x11-x100, then assuming that the value space is infinite, then my values on dimensions x11-x100 could drift arbitrarily far away from their current positions. So the distance between my future and current values could end up being arbitrarily large, but if my important values remained in their current positions, then this would still not be arbitrarily bad, since the values that I actually care about have not drifted.
Obviously this toy model is flawed, since I don’t think it actually makes sense to model values as being totally independent of each other, but maybe you get the intuition that I’m trying to point at anyway.
This would suggest that the problem is not that “values can get arbitrarily distant” but rather something like “the meta-values that make us care about some of our values having specific values, may get violated”. (Of course, “values can get arbitrarily distant” can still be the problem if you have a meta-value that says that they shouldn’t do that.)
Off the top of my head, right now I value things such as nature, literature, sex, democracy, the rule of law, the human species and so on, but if my descendants had none of those things and had replaced them with something totally different and utterly incomprehensible, that’d be fine with me as long as they were happy and didn’t suffer much.
Some are instrumental yes, though I guess that for “weak preferences”, it would be more accurate to say that I value some things for my own sake rather than for their sake. That is, I want to be able to experience them myself, but if others find them uninteresting and they vanish entirely after I’m gone, that’s cool.
(There has to be some existing standard term for this.)
That doesn’t sound complicated or mysterious at all—you value these for yourself, but not necessarily for everyone. So if other people lack these values, then that’s not far from your initial values, but if you lack them, then it is far.
This seems to remove the point of your initial answer?
So if other people lack these values, then that’s not far from your initial values, but if you lack them, then it is far.
Well, that depends on how you choose the similarity metric. Like, if you code “the distance between Kaj’s values and Stuart’s values” as the Jaccard distance between them, then you could make the distance between our values arbitrarily large by just adding values I have but you don’t, or vice versa. So if you happened to lack a lot of my values, then our values would be far.
Jaccard distance probably isn’t a great choice of metric for this purpose, but I don’t know what a good one would be.
If we make the (false) assumption that we both have utility/reward functions, and E_U(V) is the expected utility of utility V if we assume a U maximiser is maximising it, then we can measure the distance between utility U and V as d(U,V)=E_U(U)-E_V(U).
This is non-symmetric and doesn’t obey the triangle inequality, but it is a very natural measure—it represents the cost to U to replace a U-maximiser with a V-maximiser.
Equivalently, we can say that we don’t know how we should define the dimensions of the human values or the distance measure from current human values, and if we pick these definitions arbitrarily, we will end up with arbitrary results.
I feel like there’s a distinct difference between “human values could end up arbitrarily distant from current ones” and “human values could end up arbitrarily bad”.
In that, I feel that I have certain values that I care a lot about (for example, wanting there to be less intense suffering in the world) and which I wouldn’t want to change; and also other values for which I don’t care about how much they’d happen to drift.
If you think of this as my important values being points on dimension x1-x10, and my non-important values as being points on dimensions x11-x100, then assuming that the value space is infinite, then my values on dimensions x11-x100 could drift arbitrarily far away from their current positions. So the distance between my future and current values could end up being arbitrarily large, but if my important values remained in their current positions, then this would still not be arbitrarily bad, since the values that I actually care about have not drifted.
Obviously this toy model is flawed, since I don’t think it actually makes sense to model values as being totally independent of each other, but maybe you get the intuition that I’m trying to point at anyway.
This would suggest that the problem is not that “values can get arbitrarily distant” but rather something like “the meta-values that make us care about some of our values having specific values, may get violated”. (Of course, “values can get arbitrarily distant” can still be the problem if you have a meta-value that says that they shouldn’t do that.)
>and also other values for which I don’t care about how much they’d happen to drift.
Hum. In what way can you be said to have these values then? Maybe these are un-endorsed preferences? Do you have a specific example?
Off the top of my head, right now I value things such as nature, literature, sex, democracy, the rule of law, the human species and so on, but if my descendants had none of those things and had replaced them with something totally different and utterly incomprehensible, that’d be fine with me as long as they were happy and didn’t suffer much.
If I said that some of these were instrumental preferences, and some of these were weak preferences, would that cover it all?
Some are instrumental yes, though I guess that for “weak preferences”, it would be more accurate to say that I value some things for my own sake rather than for their sake. That is, I want to be able to experience them myself, but if others find them uninteresting and they vanish entirely after I’m gone, that’s cool.
(There has to be some existing standard term for this.)
That doesn’t sound complicated or mysterious at all—you value these for yourself, but not necessarily for everyone. So if other people lack these values, then that’s not far from your initial values, but if you lack them, then it is far.
This seems to remove the point of your initial answer?
Well, that depends on how you choose the similarity metric. Like, if you code “the distance between Kaj’s values and Stuart’s values” as the Jaccard distance between them, then you could make the distance between our values arbitrarily large by just adding values I have but you don’t, or vice versa. So if you happened to lack a lot of my values, then our values would be far.
Jaccard distance probably isn’t a great choice of metric for this purpose, but I don’t know what a good one would be.
If we make the (false) assumption that we both have utility/reward functions, and E_U(V) is the expected utility of utility V if we assume a U maximiser is maximising it, then we can measure the distance between utility U and V as d(U,V)=E_U(U)-E_V(U).
This is non-symmetric and doesn’t obey the triangle inequality, but it is a very natural measure—it represents the cost to U to replace a U-maximiser with a V-maximiser.
Equivalently, we can say that we don’t know how we should define the dimensions of the human values or the distance measure from current human values, and if we pick these definitions arbitrarily, we will end up with arbitrary results.