But humans are capable of thinking about what their values “actually should be” including whether or not they should be the values evolution selected for (either alone or in addition to other things). We’re also capable of thinking about whether things like wireheading are actually good to do, even after trying it for a bit.
We don’t simply commit to tricking our reward systems forever and only doing that, for example.
So that overall suggests a level of coherency and consistency in the “coherent extrapolated volition” sense. Evolution enabled CEV without us becoming completely orthogonal to evolution, for example.
We don’t have the option to “trick our reward systems forever”—e.g. because becoming a heroin addict tends to be self-destructive. If [guaranteed 80-year continuous heroin high followed by painless death] were an option, many people would take it (though not all).
The divergence between stated preferences and revealed preferences is exactly what we’d expect to see in worlds where we’re constantly “tricking our reward system” in small ways: our revealed preferences are not what we think they “actually should be”.
We tend to define large ways of tricking our reward systems as those that are highly self-destructive. It’s not surprising that we tend to observe few of these, since evolution tends to frown upon highly self-destructive behaviour.
Again, I’d ask for an example of a world plausibly reachable through an evolutionary process where we don’t have the kind of coherence and consistency you’re talking about.
Being completely orthogonal to evolution clearly isn’t plausible, since we wouldn’t be here (I note that when I don’t care about x, I sacrifice x to get what I do care about—I don’t take actions that are neutral with respect to x). Being not-entirely-in-line with evolution, and not-entirely-in-line with our stated preferences is exactly what we observe.
But humans are capable of thinking about what their values “actually should be” including whether or not they should be the values evolution selected for (either alone or in addition to other things). We’re also capable of thinking about whether things like wireheading are actually good to do, even after trying it for a bit.
We don’t simply commit to tricking our reward systems forever and only doing that, for example.
So that overall suggests a level of coherency and consistency in the “coherent extrapolated volition” sense. Evolution enabled CEV without us becoming completely orthogonal to evolution, for example.
A few points here:
We don’t have the option to “trick our reward systems forever”—e.g. because becoming a heroin addict tends to be self-destructive. If [guaranteed 80-year continuous heroin high followed by painless death] were an option, many people would take it (though not all).
The divergence between stated preferences and revealed preferences is exactly what we’d expect to see in worlds where we’re constantly “tricking our reward system” in small ways: our revealed preferences are not what we think they “actually should be”.
We tend to define large ways of tricking our reward systems as those that are highly self-destructive. It’s not surprising that we tend to observe few of these, since evolution tends to frown upon highly self-destructive behaviour.
Again, I’d ask for an example of a world plausibly reachable through an evolutionary process where we don’t have the kind of coherence and consistency you’re talking about.
Being completely orthogonal to evolution clearly isn’t plausible, since we wouldn’t be here (I note that when I don’t care about x, I sacrifice x to get what I do care about—I don’t take actions that are neutral with respect to x).
Being not-entirely-in-line with evolution, and not-entirely-in-line with our stated preferences is exactly what we observe.