In this frame, I can self-reflect on a given circuit and ask, “Does this circuit actually push me towards what I think is good?” When doing this, I’ll be using some more meta/higher-order circuits (concepts I’ve built up over time about what a “good” brain looks like) but I’ll also be using lower level circuits, and I might even end up using the evaluated circuit itself in this evaluation process.
Sometimes this reflection process will go smooth. Sometimes it won’t. But one takeaway/claim is you have this complex roundabout process for re-evaluating your values when some circuits begin to think that other circuits have diverged from “good”.
Because of this ability to reflect and change, it seems correct to say that “I value things conditional on my environment” (where environment has a lot of flex, it could be as small as your work space, or as broad as “any existing human culture”).
Example. Let’s say there was literally no scarcity for survival goods (food water etc). It seems like a HUGE chunk of my values and morals are built up inferences and solutions to resource allocation problems. If resource scarcity was magically no longer a problem, much of my values have lost their connection to reality. From what I’ve seen so far of my own self-reflection process, it seems likely that overtime I would come to reorganize my values in such a post-scarcity world. I’ve also currently got no clue what that reorganization would look like.
AFI worry: A human-in-the-loop AI that only takes actions that get human approval (and whose expected outcomes have human approval) hits big problems when the context the AI is acting in is a very different context from where our values were trained.
Is there any way around this besides simulating people having their values re-organized given the new environment? Is this what CEV is about?
Weirdness that comes from reflection:
In this frame, I can self-reflect on a given circuit and ask, “Does this circuit actually push me towards what I think is good?” When doing this, I’ll be using some more meta/higher-order circuits (concepts I’ve built up over time about what a “good” brain looks like) but I’ll also be using lower level circuits, and I might even end up using the evaluated circuit itself in this evaluation process.
Sometimes this reflection process will go smooth. Sometimes it won’t. But one takeaway/claim is you have this complex roundabout process for re-evaluating your values when some circuits begin to think that other circuits have diverged from “good”.
Because of this ability to reflect and change, it seems correct to say that “I value things conditional on my environment” (where environment has a lot of flex, it could be as small as your work space, or as broad as “any existing human culture”).
Example. Let’s say there was literally no scarcity for survival goods (food water etc). It seems like a HUGE chunk of my values and morals are built up inferences and solutions to resource allocation problems. If resource scarcity was magically no longer a problem, much of my values have lost their connection to reality. From what I’ve seen so far of my own self-reflection process, it seems likely that overtime I would come to reorganize my values in such a post-scarcity world. I’ve also currently got no clue what that reorganization would look like.
AFI worry: A human-in-the-loop AI that only takes actions that get human approval (and whose expected outcomes have human approval) hits big problems when the context the AI is acting in is a very different context from where our values were trained.
Is there any way around this besides simulating people having their values re-organized given the new environment? Is this what CEV is about?