My first question would be “how do you define human values”? Here are two possible answers:
“Human values” is “Everything a human wants / desires / prefers”
“Human values” is “What would the person say if you ask them what their deepest values are?” (Maybe with additional complications like “…if they had time to reflect” or “…if they were sufficiently wise” or whatever.)
I think #2 is how most people use the term “values”, but I have heard at least a couple AI alignment researchers use definition #1, so I figure it’s worth checking.
I would say #1 is the easier question. #1 is asking a rather direct question about brain algorithms; whereas #2 involves (A) philosophy, for deciding what the “proper” definition / operationalization of “human values” is, and then (B) walking through that scenario / definition in light of #1.
As for #1, see my post series Intro to Brain-Like-AGI Safety. I think you’ll get most of what you’re looking for in posts #7 & #9. You might find that you need to go back and read the top (summary) section of some of the other posts to get the terminology and context.
[Is that “the best mechanistic account” of #1? Well, I’m a bit biased :) ]
For getting from #1 to #2, it depends on how we’re operationalizing “human values”, but if it’s “what the person describes as their values when asked”, then I would probably say various things along the lines of Lukas_Gloor’s comment.
In addition to #1 and #2, I’m interested in another definition: “human values” are “the properties of the states of the universe that humans tend to optimize towards”. Obviously this has a lot to do with definitions 1 and 2, and could be analyzed as an emergent consequence of 1 and 2 together with facts about how humans act in response to their desires and goals. Plus maybe a bit of sociology, since most large-scale human optimization of the universe depends on the collective action of groups.
Take a coordination problem, like overfishing. No individual fisher wants overfishing to happen, but each is trying to feed their family, and we wind up with the fish going extinct. Would you say that “human values” are to overfish to extinction, or to preserve the fishery?
Let’s say that me and my brother are both addicted to cigarettes, and both want to quit. I eventually successfully quit, but my brother fails whenever he tries, and continues to smoke for the rest of his life. Would you say that me and my brother have similar “values”, or opposite “values”, concerning cigarette-smoking?
Hmm, I think in both those cases I would be inclined to say that “human values” better matches what people say they want, so maybe “values” isn’t a great name for this concept.
Nevertheless I think the divergence is often interesting and perhaps “morally significant”. Two examples:
The net effect of people’s actions may often be to facilitate their own self-reproduction and that of their kin, even though that might not be either what they say their ultimate values are or what they want on a day-to-day basis(ofc this happens because evolution has optimized their short-term values to ultimately lead to self-reproduction)
People can sometimes acquire values in the process of interacting with the world and other people. So for example, many Western countries have a value of religious tolerance. But we didn’t start off with that value—instead, it emerged as a solution to bitter religious conflict. At the time of the conflict, each side’s explicit values and desires were to crush their enemies and have their religion reign supreme(well I’m sure it was much more complicated than that, but I’m going to pretend it wasn’t for the sake of the example). Or people can acquire value of toughness and resilience through coping with an extreme environment, but then continue to hold those values even when their environment becomes more comfortable.
Anyways, maybe these situations don’t necessitate introducing a new definition of value, but I think they capture some dynamics that are important and not totally evident from the definitions of 1 and 2 alone. Maybe one way of framing it is that the correct ‘extrapolation’ in definition #2 might not just include thinking more, but also interacting with other people and the world, sometimes in ways you might not initially endorse. Or maybe the ‘correct’ definition of our values might look something like “self-reproduction, together with the reproduction of the entire ecology in which your behavior makes sense”(similar to Quintin Pope’s answer) And it seems like AGI systems might also exhibit some of these dynamics, so analyzing them, and understanding what exactly which features of our cognition lead to them, may be important.
I’m a little hesitant to look for highly specific definitions of “human values” at this stage. We seem fundamentally confused about the topic, and I worry that specific definitions generated while confused may guide our thinking in ways we don’t anticipate or want. I’ve kept my internal definition of value pretty vague, something like “the collection of cognitive processes that make a given possible future seem more or less desirable”.
I think that, if we ever de-confuse human values, we’ll find they’re more naturally divided along lines we wouldn’t have thought of in our currently confused state. I think hints of this emerge in my analysis of “values as mesa optimizers”.
If the brain simultaneously learns to maximize reward circuit activation AND to model the world around it, then those represent two different types of selection pressures applied to our neural circuitry. I think those two selection pressures give rise to two different types of values, which are separated from each other on an axis I’d have never considered before.
Tentatively, the “reward circuit activation” pressure seems to give rise to values that are more “maximalist” or “expansive” (we want there to by lots of happy people in the future). The “world modelling” pressure seems to give rise to values that are more “preserving” (we want the future to have room for more than just happiness).
These two types of values seem like they’re often in tension, and I could see reconciling between them as a major area of study for a true “theory of human values”.
(You can replace “happiness” with whatever distribution of emotions you think optimal, and some degree of tension still remains)
Definitely agreed that we shouldn’t try to obtain a highly specific definition of human values right now. And that we’ll likely find that better formulations lead to breaking down human values in ways we currently wouldn’t expect.
My first question would be “how do you define human values”? Here are two possible answers:
“Human values” is “Everything a human wants / desires / prefers”
“Human values” is “What would the person say if you ask them what their deepest values are?” (Maybe with additional complications like “…if they had time to reflect” or “…if they were sufficiently wise” or whatever.)
I think #2 is how most people use the term “values”, but I have heard at least a couple AI alignment researchers use definition #1, so I figure it’s worth checking.
I would say #1 is the easier question. #1 is asking a rather direct question about brain algorithms; whereas #2 involves (A) philosophy, for deciding what the “proper” definition / operationalization of “human values” is, and then (B) walking through that scenario / definition in light of #1.
As for #1, see my post series Intro to Brain-Like-AGI Safety. I think you’ll get most of what you’re looking for in posts #7 & #9. You might find that you need to go back and read the top (summary) section of some of the other posts to get the terminology and context.
[Is that “the best mechanistic account” of #1? Well, I’m a bit biased :) ]
For getting from #1 to #2, it depends on how we’re operationalizing “human values”, but if it’s “what the person describes as their values when asked”, then I would probably say various things along the lines of Lukas_Gloor’s comment.
In addition to #1 and #2, I’m interested in another definition: “human values” are “the properties of the states of the universe that humans tend to optimize towards”. Obviously this has a lot to do with definitions 1 and 2, and could be analyzed as an emergent consequence of 1 and 2 together with facts about how humans act in response to their desires and goals. Plus maybe a bit of sociology, since most large-scale human optimization of the universe depends on the collective action of groups.
Interesting! I have a couple follow-up questions.
Take a coordination problem, like overfishing. No individual fisher wants overfishing to happen, but each is trying to feed their family, and we wind up with the fish going extinct. Would you say that “human values” are to overfish to extinction, or to preserve the fishery?
Let’s say that me and my brother are both addicted to cigarettes, and both want to quit. I eventually successfully quit, but my brother fails whenever he tries, and continues to smoke for the rest of his life. Would you say that me and my brother have similar “values”, or opposite “values”, concerning cigarette-smoking?
Hmm, I think in both those cases I would be inclined to say that “human values” better matches what people say they want, so maybe “values” isn’t a great name for this concept.
Nevertheless I think the divergence is often interesting and perhaps “morally significant”. Two examples:
The net effect of people’s actions may often be to facilitate their own self-reproduction and that of their kin, even though that might not be either what they say their ultimate values are or what they want on a day-to-day basis(ofc this happens because evolution has optimized their short-term values to ultimately lead to self-reproduction)
People can sometimes acquire values in the process of interacting with the world and other people. So for example, many Western countries have a value of religious tolerance. But we didn’t start off with that value—instead, it emerged as a solution to bitter religious conflict. At the time of the conflict, each side’s explicit values and desires were to crush their enemies and have their religion reign supreme(well I’m sure it was much more complicated than that, but I’m going to pretend it wasn’t for the sake of the example). Or people can acquire value of toughness and resilience through coping with an extreme environment, but then continue to hold those values even when their environment becomes more comfortable.
Anyways, maybe these situations don’t necessitate introducing a new definition of value, but I think they capture some dynamics that are important and not totally evident from the definitions of 1 and 2 alone. Maybe one way of framing it is that the correct ‘extrapolation’ in definition #2 might not just include thinking more, but also interacting with other people and the world, sometimes in ways you might not initially endorse. Or maybe the ‘correct’ definition of our values might look something like “self-reproduction, together with the reproduction of the entire ecology in which your behavior makes sense”(similar to Quintin Pope’s answer) And it seems like AGI systems might also exhibit some of these dynamics, so analyzing them, and understanding what exactly which features of our cognition lead to them, may be important.
I’m a little hesitant to look for highly specific definitions of “human values” at this stage. We seem fundamentally confused about the topic, and I worry that specific definitions generated while confused may guide our thinking in ways we don’t anticipate or want. I’ve kept my internal definition of value pretty vague, something like “the collection of cognitive processes that make a given possible future seem more or less desirable”.
I think that, if we ever de-confuse human values, we’ll find they’re more naturally divided along lines we wouldn’t have thought of in our currently confused state. I think hints of this emerge in my analysis of “values as mesa optimizers”.
If the brain simultaneously learns to maximize reward circuit activation AND to model the world around it, then those represent two different types of selection pressures applied to our neural circuitry. I think those two selection pressures give rise to two different types of values, which are separated from each other on an axis I’d have never considered before.
Tentatively, the “reward circuit activation” pressure seems to give rise to values that are more “maximalist” or “expansive” (we want there to by lots of happy people in the future). The “world modelling” pressure seems to give rise to values that are more “preserving” (we want the future to have room for more than just happiness).
These two types of values seem like they’re often in tension, and I could see reconciling between them as a major area of study for a true “theory of human values”.
(You can replace “happiness” with whatever distribution of emotions you think optimal, and some degree of tension still remains)
Definitely agreed that we shouldn’t try to obtain a highly specific definition of human values right now. And that we’ll likely find that better formulations lead to breaking down human values in ways we currently wouldn’t expect.