I disagree with how this post uses the word “values” throughout, rather than “desires” (or “preferences”) which (AFAICT) would be a better match to how the term is being used here.
This has definitely been a point of confusion. There are a couple of ways one might reasonably interpret the phrase “human values”:
the common denominator between all humans ever about what they care about
the ethical consensus of (some subset of) humanity
the injunctions a particular human would verbally endorse
the cognitive artifacts inside each particular human that implement that human’s valuing-of-X, including the cases where they verbally endorse that valuing (along with a bunch of other kinds of preferences, both wanted and unwanted)
I think the shard theory workstream generally uses “human values” in the last sense.
I view values as action-determinants, the moment-to-moment internal tugs which steer my thinking and action. I cash out “values”/”subshards” as “contextually activated computations which are shaped into existence by past reinforcement, and which often steer towards their historical reinforcers.”
This is a considerably wider definition of “human values” than usually considered. For example, I might have a narrowly activated value against taking COVID tests, because past reinforcement slapped down my decision to do so after I tested positive and realized I’d be isolating all alone. (The testing was part of why I had to isolate, after all, so I infer that my credit assignment tagged that decision for down-weighting)
This unusual definitional broadness is a real cost, which is why I like talking about an anti-COVID-test subshard, instead of an anti-test “value.”
This has definitely been a point of confusion. There are a couple of ways one might reasonably interpret the phrase “human values”:
the common denominator between all humans ever about what they care about
the ethical consensus of (some subset of) humanity
the injunctions a particular human would verbally endorse
the cognitive artifacts inside each particular human that implement that human’s valuing-of-X, including the cases where they verbally endorse that valuing (along with a bunch of other kinds of preferences, both wanted and unwanted)
I think the shard theory workstream generally uses “human values” in the last sense.
I view values as action-determinants, the moment-to-moment internal tugs which steer my thinking and action. I cash out “values”/”subshards” as “contextually activated computations which are shaped into existence by past reinforcement, and which often steer towards their historical reinforcers.”
This is a considerably wider definition of “human values” than usually considered. For example, I might have a narrowly activated value against taking COVID tests, because past reinforcement slapped down my decision to do so after I tested positive and realized I’d be isolating all alone. (The testing was part of why I had to isolate, after all, so I infer that my credit assignment tagged that decision for down-weighting)
This unusual definitional broadness is a real cost, which is why I like talking about an anti-COVID-test subshard, instead of an anti-test “value.”
Hmm, I think I’d say “I might feel an aversion…” there.
“Desires & aversions” would work in a context where the sign was ambiguous. So would “preferences”.