80% credence: It’s very hard to train an inner agent which reflectively equilibrates to an EU maximizer only over commonly-postulated motivating quantities (like # of diamonds or # of happy people or reward-signal) and not quantities like (# of times I have to look at a cube in a blue room or -1 * subjective micromorts accrued).
Intuitions:
I expect contextually activated heuristics to be the default, and that agents will learn lots of such contextual values which don’t cash out to being strictly about diamonds or people, even if the overall agent is mostly motivated in terms of diamonds or people.
Agents might also “terminalize” instrumental subgoals by caching computations (e.g. cache the heuristic that dying is bad, without recalculating from first principles for every plan in which you might die).
Therefore, I expect this value-spread to be convergently hard to avoid.
I think that shards will cast contextual shadows into the factors of a person’s equilibrated utility function, because I think the shards are contextually activated to begin with. For example, if a person hates doing jumping jacks in front of a group of her peers, then that part of herself can bargain to penalize jumping jacks just in those contexts in the final utility function. Compared to a blanket “no jumping jacks ever” rule, this trade is less costly to other shards and allows more efficient trades to occur.
80% credence: It’s very hard to train an inner agent which reflectively equilibrates to an EU maximizer only over commonly-postulated motivating quantities (like
# of diamonds
or# of happy people
orreward-signal
) and not quantities like (# of times I have to look at a cube in a blue room
or-1 * subjective micromorts accrued
).Intuitions:
I expect contextually activated heuristics to be the default, and that agents will learn lots of such contextual values which don’t cash out to being strictly about diamonds or people, even if the overall agent is mostly motivated in terms of diamonds or people.
Agents might also “terminalize” instrumental subgoals by caching computations (e.g. cache the heuristic that dying is bad, without recalculating from first principles for every plan in which you might die).
Therefore, I expect this value-spread to be convergently hard to avoid.
I think that shards will cast contextual shadows into the factors of a person’s equilibrated utility function, because I think the shards are contextually activated to begin with. For example, if a person hates doing jumping jacks in front of a group of her peers, then that part of herself can bargain to penalize jumping jacks just in those contexts in the final utility function. Compared to a blanket “no jumping jacks ever” rule, this trade is less costly to other shards and allows more efficient trades to occur.