I’m a little hesitant to look for highly specific definitions of “human values” at this stage. We seem fundamentally confused about the topic, and I worry that specific definitions generated while confused may guide our thinking in ways we don’t anticipate or want. I’ve kept my internal definition of value pretty vague, something like “the collection of cognitive processes that make a given possible future seem more or less desirable”.
I think that, if we ever de-confuse human values, we’ll find they’re more naturally divided along lines we wouldn’t have thought of in our currently confused state. I think hints of this emerge in my analysis of “values as mesa optimizers”.
If the brain simultaneously learns to maximize reward circuit activation AND to model the world around it, then those represent two different types of selection pressures applied to our neural circuitry. I think those two selection pressures give rise to two different types of values, which are separated from each other on an axis I’d have never considered before.
Tentatively, the “reward circuit activation” pressure seems to give rise to values that are more “maximalist” or “expansive” (we want there to by lots of happy people in the future). The “world modelling” pressure seems to give rise to values that are more “preserving” (we want the future to have room for more than just happiness).
These two types of values seem like they’re often in tension, and I could see reconciling between them as a major area of study for a true “theory of human values”.
(You can replace “happiness” with whatever distribution of emotions you think optimal, and some degree of tension still remains)
Definitely agreed that we shouldn’t try to obtain a highly specific definition of human values right now. And that we’ll likely find that better formulations lead to breaking down human values in ways we currently wouldn’t expect.
I’m a little hesitant to look for highly specific definitions of “human values” at this stage. We seem fundamentally confused about the topic, and I worry that specific definitions generated while confused may guide our thinking in ways we don’t anticipate or want. I’ve kept my internal definition of value pretty vague, something like “the collection of cognitive processes that make a given possible future seem more or less desirable”.
I think that, if we ever de-confuse human values, we’ll find they’re more naturally divided along lines we wouldn’t have thought of in our currently confused state. I think hints of this emerge in my analysis of “values as mesa optimizers”.
If the brain simultaneously learns to maximize reward circuit activation AND to model the world around it, then those represent two different types of selection pressures applied to our neural circuitry. I think those two selection pressures give rise to two different types of values, which are separated from each other on an axis I’d have never considered before.
Tentatively, the “reward circuit activation” pressure seems to give rise to values that are more “maximalist” or “expansive” (we want there to by lots of happy people in the future). The “world modelling” pressure seems to give rise to values that are more “preserving” (we want the future to have room for more than just happiness).
These two types of values seem like they’re often in tension, and I could see reconciling between them as a major area of study for a true “theory of human values”.
(You can replace “happiness” with whatever distribution of emotions you think optimal, and some degree of tension still remains)
Definitely agreed that we shouldn’t try to obtain a highly specific definition of human values right now. And that we’ll likely find that better formulations lead to breaking down human values in ways we currently wouldn’t expect.