Values Are Real Like Harry Potter

Imagine a TV showing a video of a bizarre, unfamiliar object—let’s call it a squirgle. The video was computer generated by a one-time piece of code, so there’s no “real squirgle” somewhere else in the world which the video is showing. Nonetheless, there’s still some substantive sense in which the squirgle on screen is “a thing”—even though I can only ever see it through the screen, I can still:

  • Notice that the same squirgle is shown at one time and another time.

  • Predict that a shiny spot on the squirgle will still be there later.

  • Discover new things about the squirgle, like a scratch on it.

… and so forth. The squirgle is still “a thing” about which I can have beliefs and learn things. Its “thingness” stems from the internal consistency/​compressibility of what the TV is showing.

AI imagines a squirgle.

Similarly, Harry Potter is “a thing”. Like the squirgle, Harry Potter is fictional; there’s no “actual” Harry Potter “out in the real world”, just like the TV doesn’t show any “actual” squirgle “out in the real world”.[1] Nonetheless, I can know things about Harry Potter, the things I know about Harry Potter can have predictive power, and I can discover new things about Harry Potter.

So what does it mean for Harry (or the squirgle) to be “fictional”? Well, it means we can only ever “see” Harry through metaphorical TV screens—be it words on a page, or literal screens.[2]

Claim: human values are “fictional” in that same sense, just like the squirgle or Harry Potter. They’re still “a thing”, we can learn about our values and have beliefs about our values and so forth, but we can only “see” them by looking through a metaphorical screen; they don’t represent some physical thing “out in the real world”. The screen through which we can “see” our values is the reward signals received by our brain.

Background: Value Reinforcement Learning

In a previous post, we presented a puzzle:

  • Humans sure do seem to have beliefs about our own values and learn about our own values, in a pretty ordinary epistemic way…

  • But that implies some kind of evidence has to cross the is-ought gap.

We proposed that the puzzle is resolved by our brains treating reward signals as evidence about our own values, and then trying to learn about our values from that reward signal via roughly ordinary epistemic reasoning. The key distinction from standard reinforcement learning is that we have ordinary internal symbolic representations of our values, and beliefs about our values. As one particular consequence, that feature allows us to avoid wireheading.

(In fact, it turns out that Marcus Hutter and Tom Everitt proposed an idealized version of this model for Solomonoff-style minds under the name “Value Reinforcement Learning”. They introduced it mainly as a way to avoid wireheading in an AIXI-like mind.)

But this model leaves some puzzling conceptual questions about the “values” of value-reinforcement-learners. In what sense, if any, are those values “real”? How do we distinguish between e.g. a change in our values vs a change in our beliefs about our values, or between “reliable” vs “hacked” signals from our reward stream?

The tight analogy to other fictional “things”, like the squirgle or Harry Potter, helps answer those sorts of questions.

In What Sense Are Values “Real” Or “A Thing”?

The squirgle is “a thing” to exactly the extent that the images on the TV can be compactly represented as many different images of a single object. If someone came along and said “the squirgle isn’t even a thing, why are you using this concept at all?” I could respond “well, you’re going to have a much tougher time accurately predicting or compressing the images shown by that TV without at least an implicit concept equivalent to that squirgle”.

Likewise with Harry Potter. Harry Potter is “a thing” to exactly the extent that a whole bunch of books and movies and so forth consistently show text/​images/​etc which can be compactly represented as many different depictions of the same boy. If someone came along and said “Harry Potter isn’t even a thing, why are you using this concept at all?” I could respond “well, you’re going to have a much tougher time accurately predicting or compressing all this text/​images/​etc without at least an implicit concept equivalent to Harry Potter”.

Same with values. A human’s values—not the human’s estimate of their own values, not their revealed or stated preferences, but their actual values, the thing which their estimates-of-their-own-values are an estimate of—are “a thing” to exactly the extent that a whole bunch of the reward signals to that human’s brain can be compactly represented as generated by some consistent valuation. If someone came along and said “a human’s estimates of their own values aren’t an estimate of any actual thing, there’s no real thing there which the human is estimating” I could respond “well, you’re going to have a much tougher time accurately predicting or compressing all these reward signals without at least an implicit concept equivalent to this human’s values”. (Note that there’s a nontrivial empirical claim here: it could be that the human’s reward signals are not, in fact, well-compressed this way, in which case the skeptic would be entirely correct!)

How Do We Distinguish A Change In Values From A Change In Beliefs About Values?

Suppose I’m watching the video of the squirgle, and suddenly a different squirgle appears—an object which is clearly “of the same type”, but differs in the details. Or, imagine the squirgle gradually morphs into a different squirgle. Either way, I can see on the screen that the squirgle is changing. The screen consistently shows one squirgle earlier, and a different squirgle later. Then the images on the screen are well-compressed by saying “there was one squirgle earlier, and another squirgle later”. This is a change in the squirgle.

On the other hand, if the images keep showing the same squirgle over time, but at some point I notice a feathery patch that I hadn’t noticed before, then that’s a change in my beliefs about the squirgle. The images are not well-compressed by saying “there was one squirgle earlier, and another squirgle later”; I could go look at earlier images and see that the squirgle looked the same. It was my beliefs which changed.

Likewise for values and reward: if something physiologically changes my rewards on a long timescale, I may consistently see different values earlier vs later on that long timescale, and it makes sense to interpret that as values changing over time. Aging and pregnancy are classic examples: our bodies give us different reward signals as we grow older, and different reward signals when we have children. Those metaphorical screens show us different values, so it makes sense to treat that as a change in values, as opposed to a change in our beliefs about values.

On the other hand, I might think I value ‘power’ even if there are some externalities along the way, but then when push comes to shove I notice myself feeling a lot more squeamish about the idea of acquiring power by stepping on others than I expected to. I might realize that on reflection, at every point I would have actually been quite squeamish about crushing people to get what I wanted; I was quantitatively wrong about my values; they didn’t change, my knowledge of them did. I do value ‘power’, but not at such cost to others.

How Do We Distinguish Reliable From Unreliable Reward Data?

Imagine that I’m watching the video of the squirgle, and suddenly the left half of the TV blue-screens. Then I’d probably think “ah, something messed up the TV, so it’s no longer showing me the squirgle” as opposed to “ah, half the squirgle just turned into a big blue square”. I know that big square chunks turning a solid color is a typical way for TVs to break, which largely explains away the observation; I think it much more likely that the blue half-screen came from some failure of the TV rather than an unprecedented behavior of the squirgle.

Likewise, if I see some funny data in my reward stream (like e.g. feeling a drug rush), I think “ah, something is messing with my reward stream” as opposed to “ah, my values just completely changed into something weirder/​different”. I know that something like a drug rush is a standard way for a reward stream to be “hacked” into showing a different thing; I think it much more likely that the rush is coming from drugs messing with my rewards than from new data about the same values as before.

Thank you to Eli and Steve for their questions/​comments on the previous post, which provided much of the impetus for this post.

  1. ^

    You might think: “but there’s a real TV screen, or a real pattern in JK Rowling’s brain; aren’t those the real things out in the world?”. The key distinction is between symbol and referent—the symbols, like the pattern in JK Rowling’s brain or the lights on the TV or the words on a page, are “out in the real world”. But those symbols don’t have any referents out in the real world. There is still meaningfully “a thing” (or “things”) which the symbols represent, as evidenced by our beliefs about the “thing(s)” having predictive power for the symbols themselves, but the “thing(s)” the symbols represent isn’t out in the real world.

  2. ^

    Importantly, Harry’s illustrated thoughts and behavior, and the squirgle’s appearance over time, are well-compressed via an internally consistent causal model much structurally richer than the screen/​text, despite being “fictional.”