Hmm, I think in both those cases I would be inclined to say that “human values” better matches what people say they want, so maybe “values” isn’t a great name for this concept.
Nevertheless I think the divergence is often interesting and perhaps “morally significant”. Two examples:
The net effect of people’s actions may often be to facilitate their own self-reproduction and that of their kin, even though that might not be either what they say their ultimate values are or what they want on a day-to-day basis(ofc this happens because evolution has optimized their short-term values to ultimately lead to self-reproduction)
People can sometimes acquire values in the process of interacting with the world and other people. So for example, many Western countries have a value of religious tolerance. But we didn’t start off with that value—instead, it emerged as a solution to bitter religious conflict. At the time of the conflict, each side’s explicit values and desires were to crush their enemies and have their religion reign supreme(well I’m sure it was much more complicated than that, but I’m going to pretend it wasn’t for the sake of the example). Or people can acquire value of toughness and resilience through coping with an extreme environment, but then continue to hold those values even when their environment becomes more comfortable.
Anyways, maybe these situations don’t necessitate introducing a new definition of value, but I think they capture some dynamics that are important and not totally evident from the definitions of 1 and 2 alone. Maybe one way of framing it is that the correct ‘extrapolation’ in definition #2 might not just include thinking more, but also interacting with other people and the world, sometimes in ways you might not initially endorse. Or maybe the ‘correct’ definition of our values might look something like “self-reproduction, together with the reproduction of the entire ecology in which your behavior makes sense”(similar to Quintin Pope’s answer) And it seems like AGI systems might also exhibit some of these dynamics, so analyzing them, and understanding what exactly which features of our cognition lead to them, may be important.
I’m a little hesitant to look for highly specific definitions of “human values” at this stage. We seem fundamentally confused about the topic, and I worry that specific definitions generated while confused may guide our thinking in ways we don’t anticipate or want. I’ve kept my internal definition of value pretty vague, something like “the collection of cognitive processes that make a given possible future seem more or less desirable”.
I think that, if we ever de-confuse human values, we’ll find they’re more naturally divided along lines we wouldn’t have thought of in our currently confused state. I think hints of this emerge in my analysis of “values as mesa optimizers”.
If the brain simultaneously learns to maximize reward circuit activation AND to model the world around it, then those represent two different types of selection pressures applied to our neural circuitry. I think those two selection pressures give rise to two different types of values, which are separated from each other on an axis I’d have never considered before.
Tentatively, the “reward circuit activation” pressure seems to give rise to values that are more “maximalist” or “expansive” (we want there to by lots of happy people in the future). The “world modelling” pressure seems to give rise to values that are more “preserving” (we want the future to have room for more than just happiness).
These two types of values seem like they’re often in tension, and I could see reconciling between them as a major area of study for a true “theory of human values”.
(You can replace “happiness” with whatever distribution of emotions you think optimal, and some degree of tension still remains)
Definitely agreed that we shouldn’t try to obtain a highly specific definition of human values right now. And that we’ll likely find that better formulations lead to breaking down human values in ways we currently wouldn’t expect.
Hmm, I think in both those cases I would be inclined to say that “human values” better matches what people say they want, so maybe “values” isn’t a great name for this concept.
Nevertheless I think the divergence is often interesting and perhaps “morally significant”. Two examples:
The net effect of people’s actions may often be to facilitate their own self-reproduction and that of their kin, even though that might not be either what they say their ultimate values are or what they want on a day-to-day basis(ofc this happens because evolution has optimized their short-term values to ultimately lead to self-reproduction)
People can sometimes acquire values in the process of interacting with the world and other people. So for example, many Western countries have a value of religious tolerance. But we didn’t start off with that value—instead, it emerged as a solution to bitter religious conflict. At the time of the conflict, each side’s explicit values and desires were to crush their enemies and have their religion reign supreme(well I’m sure it was much more complicated than that, but I’m going to pretend it wasn’t for the sake of the example). Or people can acquire value of toughness and resilience through coping with an extreme environment, but then continue to hold those values even when their environment becomes more comfortable.
Anyways, maybe these situations don’t necessitate introducing a new definition of value, but I think they capture some dynamics that are important and not totally evident from the definitions of 1 and 2 alone. Maybe one way of framing it is that the correct ‘extrapolation’ in definition #2 might not just include thinking more, but also interacting with other people and the world, sometimes in ways you might not initially endorse. Or maybe the ‘correct’ definition of our values might look something like “self-reproduction, together with the reproduction of the entire ecology in which your behavior makes sense”(similar to Quintin Pope’s answer) And it seems like AGI systems might also exhibit some of these dynamics, so analyzing them, and understanding what exactly which features of our cognition lead to them, may be important.
I’m a little hesitant to look for highly specific definitions of “human values” at this stage. We seem fundamentally confused about the topic, and I worry that specific definitions generated while confused may guide our thinking in ways we don’t anticipate or want. I’ve kept my internal definition of value pretty vague, something like “the collection of cognitive processes that make a given possible future seem more or less desirable”.
I think that, if we ever de-confuse human values, we’ll find they’re more naturally divided along lines we wouldn’t have thought of in our currently confused state. I think hints of this emerge in my analysis of “values as mesa optimizers”.
If the brain simultaneously learns to maximize reward circuit activation AND to model the world around it, then those represent two different types of selection pressures applied to our neural circuitry. I think those two selection pressures give rise to two different types of values, which are separated from each other on an axis I’d have never considered before.
Tentatively, the “reward circuit activation” pressure seems to give rise to values that are more “maximalist” or “expansive” (we want there to by lots of happy people in the future). The “world modelling” pressure seems to give rise to values that are more “preserving” (we want the future to have room for more than just happiness).
These two types of values seem like they’re often in tension, and I could see reconciling between them as a major area of study for a true “theory of human values”.
(You can replace “happiness” with whatever distribution of emotions you think optimal, and some degree of tension still remains)
Definitely agreed that we shouldn’t try to obtain a highly specific definition of human values right now. And that we’ll likely find that better formulations lead to breaking down human values in ways we currently wouldn’t expect.