Current beliefs about how human value works: various thoughts and actions can produce a “reward” signal in the brain. I also have lots of predictive circuits that fire when they anticipate a “reward” signal is coming as a result of what just happened. The predictive circuits have been trained to use the patterns of my environment to predict when the “reward” signal is coming.
Getting an “actual reward” and a predictive circuit firing will both be experienced as something “good”. Because of this, predictive circuits can not only track “actual reward” but also the activation of other predictive circuits. (So far this is basically “there’s terminal and instrumental values, and they are experienced as roughly the same thing”)
The predictive circuits are all doing some “learning process” to keep their firing correlated to what they’re tracking. However, the “quality” of this learning can vary drastically. Some circuits are more “hardwired” than others, and less able to update when they begin to become uncorrelated from what they are tracking. Some are caught in interesting feedback loops with other circuits, such that you have to update multiple circuits simultaneously, or in a particular order.
Thought every thing that feels “good” feels good because at some point or another it was tracking the base “reward” signal, it won’t always be a good idea to think of the “reward” signal as the thing you value.
Say you have a circuit that tracks a proxy of your base “reward”. If something happens in your brain such that this circuit ceases to update, you basically value this proxy terminally.
Said another way, I don’t have a nice clean ontological line between terminal values and instrumental values. The less valuable a predictive circuit, the more “terminal” the value it represents.
In this frame, I can self-reflect on a given circuit and ask, “Does this circuit actually push me towards what I think is good?” When doing this, I’ll be using some more meta/higher-order circuits (concepts I’ve built up over time about what a “good” brain looks like) but I’ll also be using lower level circuits, and I might even end up using the evaluated circuit itself in this evaluation process.
Sometimes this reflection process will go smooth. Sometimes it won’t. But one takeaway/claim is you have this complex roundabout process for re-evaluating your values when some circuits begin to think that other circuits have diverged from “good”.
Because of this ability to reflect and change, it seems correct to say that “I value things conditional on my environment” (where environment has a lot of flex, it could be as small as your work space, or as broad as “any existing human culture”).
Example. Let’s say there was literally no scarcity for survival goods (food water etc). It seems like a HUGE chunk of my values and morals are built up inferences and solutions to resource allocation problems. If resource scarcity was magically no longer a problem, much of my values have lost their connection to reality. From what I’ve seen so far of my own self-reflection process, it seems likely that overtime I would come to reorganize my values in such a post-scarcity world. I’ve also currently got no clue what that reorganization would look like.
AFI worry: A human-in-the-loop AI that only takes actions that get human approval (and whose expected outcomes have human approval) hits big problems when the context the AI is acting in is a very different context from where our values were trained.
Is there any way around this besides simulating people having their values re-organized given the new environment? Is this what CEV is about?
Current beliefs about how human value works: various thoughts and actions can produce a “reward” signal in the brain. I also have lots of predictive circuits that fire when they anticipate a “reward” signal is coming as a result of what just happened. The predictive circuits have been trained to use the patterns of my environment to predict when the “reward” signal is coming.
Getting an “actual reward” and a predictive circuit firing will both be experienced as something “good”. Because of this, predictive circuits can not only track “actual reward” but also the activation of other predictive circuits. (So far this is basically “there’s terminal and instrumental values, and they are experienced as roughly the same thing”)
The predictive circuits are all doing some “learning process” to keep their firing correlated to what they’re tracking. However, the “quality” of this learning can vary drastically. Some circuits are more “hardwired” than others, and less able to update when they begin to become uncorrelated from what they are tracking. Some are caught in interesting feedback loops with other circuits, such that you have to update multiple circuits simultaneously, or in a particular order.
Thought every thing that feels “good” feels good because at some point or another it was tracking the base “reward” signal, it won’t always be a good idea to think of the “reward” signal as the thing you value.
Say you have a circuit that tracks a proxy of your base “reward”. If something happens in your brain such that this circuit ceases to update, you basically value this proxy terminally.
Said another way, I don’t have a nice clean ontological line between terminal values and instrumental values. The less valuable a predictive circuit, the more “terminal” the value it represents.
Weirdness that comes from reflection:
In this frame, I can self-reflect on a given circuit and ask, “Does this circuit actually push me towards what I think is good?” When doing this, I’ll be using some more meta/higher-order circuits (concepts I’ve built up over time about what a “good” brain looks like) but I’ll also be using lower level circuits, and I might even end up using the evaluated circuit itself in this evaluation process.
Sometimes this reflection process will go smooth. Sometimes it won’t. But one takeaway/claim is you have this complex roundabout process for re-evaluating your values when some circuits begin to think that other circuits have diverged from “good”.
Because of this ability to reflect and change, it seems correct to say that “I value things conditional on my environment” (where environment has a lot of flex, it could be as small as your work space, or as broad as “any existing human culture”).
Example. Let’s say there was literally no scarcity for survival goods (food water etc). It seems like a HUGE chunk of my values and morals are built up inferences and solutions to resource allocation problems. If resource scarcity was magically no longer a problem, much of my values have lost their connection to reality. From what I’ve seen so far of my own self-reflection process, it seems likely that overtime I would come to reorganize my values in such a post-scarcity world. I’ve also currently got no clue what that reorganization would look like.
AFI worry: A human-in-the-loop AI that only takes actions that get human approval (and whose expected outcomes have human approval) hits big problems when the context the AI is acting in is a very different context from where our values were trained.
Is there any way around this besides simulating people having their values re-organized given the new environment? Is this what CEV is about?