An anology which might help: imagine a TV showing a video of an apple. The video was computer generated by a one-time piece of code, so there’s no “real apple” somewhere else in the world which the video is showing. Nonetheless, there’s still some substantive sense in which the apple on screen is “a thing”—even though I can only ever see it through the screen, I can still look at that apple on the screen and e.g. predict that a spot on the apple will still be there later, I can still discover things about the apple, etc.
Values, on this post’s model, are like that apple, and our reward signals are like the TV.
So the values are “not real” in the sense that they don’t correspond to anything else in the world beyond the metaphorical TV (i.e. our rewards). But there can still be a substantive sense in which the values “displayed by” the rewards are “a thing”—there are still consistent patterns in the reward stream which can mostly be predictively well-modeled as “showing some values”, much like the TV is well-modeled as “showing an apple”.
Now, imagine that I’m watching the video of the apple, and suddenly the left half of the TV blue-screens. Then I’d probably think “ah, something messed up the TV, so it’s no longer showing me the apple” as opposed to “ah, half the apple just turned into a big blue square”. Likewise, if I see some funny data in my reward stream, I think “ah, something is messing with my reward stream” as opposed to “ah, my values just completely changed into something weirder”.
A similar analogy applies to values changing over time. If I’m watching the video of the apple, and suddenly a different apple appears, or if the apple gradually morphs into a different apple… well, I can see on the screen that the apple is changing. The screen consistently shows one apple at one time, and a different apple at a later time. Likewise for values and reward: if something physiologically changes my rewards on a long timescale, I may consistently see different values earlier vs later on that long timescale, and it makes sense to interpret that as values changing over time.
Yes! that analogy is helpful for communicating what you mean!
I still have issues with your thesis though.
I agree that this “explaining away” thing could be a reasonable way to think about if eg the situation where I get sick, and while I’m sick, some activity that I usually love (let’s say singing songs) feels meaningless. I probably shouldn’t conclude that “my values” changed, just that the machinery that implements my reward circuitry is being thrown off by my being sick.
On the other hand, I think I could just as well describe this situation as extending the domain over which I’m computing my values. eg “I love and value singing songs, when I’m healthy, but when I’m sick in a particular way, I don’t love it. Singing-while-healthy is meaningful; not singing per-se.”
In the same way, I could choose to call the blue screen phenomenon an error in the TV, or I could include that dynamic as part of the “predict what will happen with the screen” game. Since there’s no real apple that I’m trying to model, only an ephemeral image of the apple, there’s not a principled place to stand on whether to view the blue-screen as an error, or just part of the image generating process.
For any given fuckery with my reward signals, I could call them errors, misrepresenting my “true values” or I could embrace them as expressing a part of my “true values.” And if two people disagree about which conceptualization to go with, I don’t know how they could possibly resolve it. They’re both valid frames, fully consistent with the data. And they can’t get distinguishing evidence, even in principle.
(I think this is not an academic point. I think people disagree about values in this way reasonably often.
Is enjoying masturbating to porn an example of your reward system getting hacked by external super-stimuli, or is that just part of the expression of your true values? Both of these are valid ways to extrapolate from the reward data time series. Which things count as your reward system getting hacked, and which things count as representing your values. It seems like a judgement call!
The classic and most fraught example is that some people find it drop dead obvious that they care about the external world, and not just their sense-impressions about the external world. They’re horrified by the thought of being put in an experience machine, even if their subjective experience would be way better.
Other people just don’t get this. “But your experience would be exactly the same as if the world was awesome. You wouldn’t be able to tell the difference”, they say. It’s obvious to them that they would prefer the experience machine, as long as their memory was wiped so they didn’t know they were in one.[1])
Talking about an epistemic process attempting to update your model of an underlying not-really-real-but-sorta structure seems to miss the degrees of freedom in the game. Since there’s no real apple, no one has any principled place to stand in claiming that “the apple really went half blue right there” vs. “no the TV signal was just interrupted.” Any question about what the apple is “really doing” is a dangling node. [2]
As a separate point, while I agree the “explaining away disruptions” phenomenon is ever a thing that happens, I don’t think that’s usually what’s happening when a person reflects on their values. Rather I guess that it’s one of the three options that I suggested above.
Admittedly, I think the question of which extrapolation schema to use is itself decided by “your values”, which ultimately grounds out in the reward data. Some people have perhaps a stronger feeling of indigence about others hiding information from them, or perhaps a stronger sense of curiosity, or whatever, that crystalizes into a general desire to know what’s true. Other people have less of that. And so they have different responses to the experience-machine hypothetical.
Because which extrapolation procedure any given person decides to use is itself a function of “their values” it all grounds out in the reward data eventually. Which perhaps defeats my point here.
For any given fuckery with my reward signals, I could call them errors, misrepresenting my “true values” or I could embrace them as expressing a part of my “true values.” And if two people disagree about which conceptualization to go with, I don’t know how they could possibly resolve it. They’re both valid frames, fully consistent with the data. And they can’t get distinguishing evidence, even in principle.
I’d classify this as an ordinary epistemic phenomenon. It came up in this thread with Richard just a couple days ago.
Core idea: when plain old Bayesian world models contain latent variables, it is ordinary for those latent variables to have some irreducible uncertainty—i.e. we’d still have some uncertainty over them even after updating on the entire physical state of the world. The latents can still be predictively useful and meaningful, they’re just not fully determinable from data, even in principle.
Standard example (copied from the thread with Richard): the Boltzman distribution for an ideal gas—not the assorted things people say about the Boltzmann distribution, but the actual math, interpreted as Bayesian probability. The model has one latent variable, the temperature T, and says that all the particle velocities are normally distributed with mean zero and variance proportional to T. Then, just following the ordinary Bayesian math: in order to estimate T from all the particle velocities, I start with some prior P[T], calculate P[T|velocities] using Bayes’ rule, and then for ~any reasonable prior I end up with a posterior distribution over T which is very tightly peaked around the average particle energy… but has nonzero spread. There’s small but nonzero uncertainty in T given all of the particle velocities. And in this simple toy gas model, those particles are the whole world, there’s nothing else to learn about which would further reduce my uncertainty in T.
Bringing it back to wireheading: first, the wireheader and the non-wireheader might just have different rewards; that’s not the conceptually interesting case, but it probably does happen. The interesting case is that the two people might have different value-estimates given basically-similar rewards, and that difference cannot be resolved by data because (like temperature in the above example) the values-latent is underdetermined by the data. In that case, the difference would be in the two peoples’ priors, which would be physiologically-embedded somehow.
(Another angle: consider Harry Potter. Harry Potter is fictional, but he’s still “a thing”; I can know things about Harry Potter, the things I know about Harry Potter can have predictive power, and I can discover new things about Harry Potter. So what does it mean for Harry to be “fictional”? Well, it means we can only ever “see” Harry through metaphorical TV screens—be it words on a page, or literal screens.
Values are “fictional” in that same sense; reward is the medium in which the fiction is expressed.)
An anology which might help: imagine a TV showing a video of an apple. The video was computer generated by a one-time piece of code, so there’s no “real apple” somewhere else in the world which the video is showing. Nonetheless, there’s still some substantive sense in which the apple on screen is “a thing”—even though I can only ever see it through the screen, I can still look at that apple on the screen and e.g. predict that a spot on the apple will still be there later, I can still discover things about the apple, etc.
Values, on this post’s model, are like that apple, and our reward signals are like the TV.
So the values are “not real” in the sense that they don’t correspond to anything else in the world beyond the metaphorical TV (i.e. our rewards). But there can still be a substantive sense in which the values “displayed by” the rewards are “a thing”—there are still consistent patterns in the reward stream which can mostly be predictively well-modeled as “showing some values”, much like the TV is well-modeled as “showing an apple”.
Now, imagine that I’m watching the video of the apple, and suddenly the left half of the TV blue-screens. Then I’d probably think “ah, something messed up the TV, so it’s no longer showing me the apple” as opposed to “ah, half the apple just turned into a big blue square”. Likewise, if I see some funny data in my reward stream, I think “ah, something is messing with my reward stream” as opposed to “ah, my values just completely changed into something weirder”.
A similar analogy applies to values changing over time. If I’m watching the video of the apple, and suddenly a different apple appears, or if the apple gradually morphs into a different apple… well, I can see on the screen that the apple is changing. The screen consistently shows one apple at one time, and a different apple at a later time. Likewise for values and reward: if something physiologically changes my rewards on a long timescale, I may consistently see different values earlier vs later on that long timescale, and it makes sense to interpret that as values changing over time.
Did that help?
Yes! that analogy is helpful for communicating what you mean!
I still have issues with your thesis though.
I agree that this “explaining away” thing could be a reasonable way to think about if eg the situation where I get sick, and while I’m sick, some activity that I usually love (let’s say singing songs) feels meaningless. I probably shouldn’t conclude that “my values” changed, just that the machinery that implements my reward circuitry is being thrown off by my being sick.
On the other hand, I think I could just as well describe this situation as extending the domain over which I’m computing my values. eg “I love and value singing songs, when I’m healthy, but when I’m sick in a particular way, I don’t love it. Singing-while-healthy is meaningful; not singing per-se.”
In the same way, I could choose to call the blue screen phenomenon an error in the TV, or I could include that dynamic as part of the “predict what will happen with the screen” game. Since there’s no real apple that I’m trying to model, only an ephemeral image of the apple, there’s not a principled place to stand on whether to view the blue-screen as an error, or just part of the image generating process.
For any given fuckery with my reward signals, I could call them errors, misrepresenting my “true values” or I could embrace them as expressing a part of my “true values.” And if two people disagree about which conceptualization to go with, I don’t know how they could possibly resolve it. They’re both valid frames, fully consistent with the data. And they can’t get distinguishing evidence, even in principle.
(I think this is not an academic point. I think people disagree about values in this way reasonably often.
Is enjoying masturbating to porn an example of your reward system getting hacked by external super-stimuli, or is that just part of the expression of your true values? Both of these are valid ways to extrapolate from the reward data time series. Which things count as your reward system getting hacked, and which things count as representing your values. It seems like a judgement call!
The classic and most fraught example is that some people find it drop dead obvious that they care about the external world, and not just their sense-impressions about the external world. They’re horrified by the thought of being put in an experience machine, even if their subjective experience would be way better.
Other people just don’t get this. “But your experience would be exactly the same as if the world was awesome. You wouldn’t be able to tell the difference”, they say. It’s obvious to them that they would prefer the experience machine, as long as their memory was wiped so they didn’t know they were in one.[1])
Talking about an epistemic process attempting to update your model of an underlying not-really-real-but-sorta structure seems to miss the degrees of freedom in the game. Since there’s no real apple, no one has any principled place to stand in claiming that “the apple really went half blue right there” vs. “no the TV signal was just interrupted.” Any question about what the apple is “really doing” is a dangling node. [2]
As a separate point, while I agree the “explaining away disruptions” phenomenon is ever a thing that happens, I don’t think that’s usually what’s happening when a person reflects on their values. Rather I guess that it’s one of the three options that I suggested above.
Tangentially, this is why I expect that the CEV of humans diverges. I think some humans, on maximal reflection, wirehead, and others don’t.
Admittedly, I think the question of which extrapolation schema to use is itself decided by “your values”, which ultimately grounds out in the reward data. Some people have perhaps a stronger feeling of indigence about others hiding information from them, or perhaps a stronger sense of curiosity, or whatever, that crystalizes into a general desire to know what’s true. Other people have less of that. And so they have different responses to the experience-machine hypothetical.
Because which extrapolation procedure any given person decides to use is itself a function of “their values” it all grounds out in the reward data eventually. Which perhaps defeats my point here.
I’d classify this as an ordinary epistemic phenomenon. It came up in this thread with Richard just a couple days ago.
Core idea: when plain old Bayesian world models contain latent variables, it is ordinary for those latent variables to have some irreducible uncertainty—i.e. we’d still have some uncertainty over them even after updating on the entire physical state of the world. The latents can still be predictively useful and meaningful, they’re just not fully determinable from data, even in principle.
Standard example (copied from the thread with Richard): the Boltzman distribution for an ideal gas—not the assorted things people say about the Boltzmann distribution, but the actual math, interpreted as Bayesian probability. The model has one latent variable, the temperature T, and says that all the particle velocities are normally distributed with mean zero and variance proportional to T. Then, just following the ordinary Bayesian math: in order to estimate T from all the particle velocities, I start with some prior P[T], calculate P[T|velocities] using Bayes’ rule, and then for ~any reasonable prior I end up with a posterior distribution over T which is very tightly peaked around the average particle energy… but has nonzero spread. There’s small but nonzero uncertainty in T given all of the particle velocities. And in this simple toy gas model, those particles are the whole world, there’s nothing else to learn about which would further reduce my uncertainty in T.
Bringing it back to wireheading: first, the wireheader and the non-wireheader might just have different rewards; that’s not the conceptually interesting case, but it probably does happen. The interesting case is that the two people might have different value-estimates given basically-similar rewards, and that difference cannot be resolved by data because (like temperature in the above example) the values-latent is underdetermined by the data. In that case, the difference would be in the two peoples’ priors, which would be physiologically-embedded somehow.
(Another angle: consider Harry Potter. Harry Potter is fictional, but he’s still “a thing”; I can know things about Harry Potter, the things I know about Harry Potter can have predictive power, and I can discover new things about Harry Potter. So what does it mean for Harry to be “fictional”? Well, it means we can only ever “see” Harry through metaphorical TV screens—be it words on a page, or literal screens.
Values are “fictional” in that same sense; reward is the medium in which the fiction is expressed.)