Yes! that analogy is helpful for communicating what you mean!
I still have issues with your thesis though.
I agree that this “explaining away” thing could be a reasonable way to think about if eg the situation where I get sick, and while I’m sick, some activity that I usually love (let’s say singing songs) feels meaningless. I probably shouldn’t conclude that “my values” changed, just that the machinery that implements my reward circuitry is being thrown off by my being sick.
On the other hand, I think I could just as well describe this situation as extending the domain over which I’m computing my values. eg “I love and value singing songs, when I’m healthy, but when I’m sick in a particular way, I don’t love it. Singing-while-healthy is meaningful; not singing per-se.”
In the same way, I could choose to call the blue screen phenomenon an error in the TV, or I could include that dynamic as part of the “predict what will happen with the screen” game. Since there’s no real apple that I’m trying to model, only an ephemeral image of the apple, there’s not a principled place to stand on whether to view the blue-screen as an error, or just part of the image generating process.
For any given fuckery with my reward signals, I could call them errors, misrepresenting my “true values” or I could embrace them as expressing a part of my “true values.” And if two people disagree about which conceptualization to go with, I don’t know how they could possibly resolve it. They’re both valid frames, fully consistent with the data. And they can’t get distinguishing evidence, even in principle.
(I think this is not an academic point. I think people disagree about values in this way reasonably often.
Is enjoying masturbating to porn an example of your reward system getting hacked by external super-stimuli, or is that just part of the expression of your true values? Both of these are valid ways to extrapolate from the reward data time series. Which things count as your reward system getting hacked, and which things count as representing your values. It seems like a judgement call!
The classic and most fraught example is that some people find it drop dead obvious that they care about the external world, and not just their sense-impressions about the external world. They’re horrified by the thought of being put in an experience machine, even if their subjective experience would be way better.
Other people just don’t get this. “But your experience would be exactly the same as if the world was awesome. You wouldn’t be able to tell the difference”, they say. It’s obvious to them that they would prefer the experience machine, as long as their memory was wiped so they didn’t know they were in one.[1])
Talking about an epistemic process attempting to update your model of an underlying not-really-real-but-sorta structure seems to miss the degrees of freedom in the game. Since there’s no real apple, no one has any principled place to stand in claiming that “the apple really went half blue right there” vs. “no the TV signal was just interrupted.” Any question about what the apple is “really doing” is a dangling node. [2]
As a separate point, while I agree the “explaining away disruptions” phenomenon is ever a thing that happens, I don’t think that’s usually what’s happening when a person reflects on their values. Rather I guess that it’s one of the three options that I suggested above.
Admittedly, I think the question of which extrapolation schema to use is itself decided by “your values”, which ultimately grounds out in the reward data. Some people have perhaps a stronger feeling of indigence about others hiding information from them, or perhaps a stronger sense of curiosity, or whatever, that crystalizes into a general desire to know what’s true. Other people have less of that. And so they have different responses to the experience-machine hypothetical.
Because which extrapolation procedure any given person decides to use is itself a function of “their values” it all grounds out in the reward data eventually. Which perhaps defeats my point here.
For any given fuckery with my reward signals, I could call them errors, misrepresenting my “true values” or I could embrace them as expressing a part of my “true values.” And if two people disagree about which conceptualization to go with, I don’t know how they could possibly resolve it. They’re both valid frames, fully consistent with the data. And they can’t get distinguishing evidence, even in principle.
I’d classify this as an ordinary epistemic phenomenon. It came up in this thread with Richard just a couple days ago.
Core idea: when plain old Bayesian world models contain latent variables, it is ordinary for those latent variables to have some irreducible uncertainty—i.e. we’d still have some uncertainty over them even after updating on the entire physical state of the world. The latents can still be predictively useful and meaningful, they’re just not fully determinable from data, even in principle.
Standard example (copied from the thread with Richard): the Boltzman distribution for an ideal gas—not the assorted things people say about the Boltzmann distribution, but the actual math, interpreted as Bayesian probability. The model has one latent variable, the temperature T, and says that all the particle velocities are normally distributed with mean zero and variance proportional to T. Then, just following the ordinary Bayesian math: in order to estimate T from all the particle velocities, I start with some prior P[T], calculate P[T|velocities] using Bayes’ rule, and then for ~any reasonable prior I end up with a posterior distribution over T which is very tightly peaked around the average particle energy… but has nonzero spread. There’s small but nonzero uncertainty in T given all of the particle velocities. And in this simple toy gas model, those particles are the whole world, there’s nothing else to learn about which would further reduce my uncertainty in T.
Bringing it back to wireheading: first, the wireheader and the non-wireheader might just have different rewards; that’s not the conceptually interesting case, but it probably does happen. The interesting case is that the two people might have different value-estimates given basically-similar rewards, and that difference cannot be resolved by data because (like temperature in the above example) the values-latent is underdetermined by the data. In that case, the difference would be in the two peoples’ priors, which would be physiologically-embedded somehow.
Yes! that analogy is helpful for communicating what you mean!
I still have issues with your thesis though.
I agree that this “explaining away” thing could be a reasonable way to think about if eg the situation where I get sick, and while I’m sick, some activity that I usually love (let’s say singing songs) feels meaningless. I probably shouldn’t conclude that “my values” changed, just that the machinery that implements my reward circuitry is being thrown off by my being sick.
On the other hand, I think I could just as well describe this situation as extending the domain over which I’m computing my values. eg “I love and value singing songs, when I’m healthy, but when I’m sick in a particular way, I don’t love it. Singing-while-healthy is meaningful; not singing per-se.”
In the same way, I could choose to call the blue screen phenomenon an error in the TV, or I could include that dynamic as part of the “predict what will happen with the screen” game. Since there’s no real apple that I’m trying to model, only an ephemeral image of the apple, there’s not a principled place to stand on whether to view the blue-screen as an error, or just part of the image generating process.
For any given fuckery with my reward signals, I could call them errors, misrepresenting my “true values” or I could embrace them as expressing a part of my “true values.” And if two people disagree about which conceptualization to go with, I don’t know how they could possibly resolve it. They’re both valid frames, fully consistent with the data. And they can’t get distinguishing evidence, even in principle.
(I think this is not an academic point. I think people disagree about values in this way reasonably often.
Is enjoying masturbating to porn an example of your reward system getting hacked by external super-stimuli, or is that just part of the expression of your true values? Both of these are valid ways to extrapolate from the reward data time series. Which things count as your reward system getting hacked, and which things count as representing your values. It seems like a judgement call!
The classic and most fraught example is that some people find it drop dead obvious that they care about the external world, and not just their sense-impressions about the external world. They’re horrified by the thought of being put in an experience machine, even if their subjective experience would be way better.
Other people just don’t get this. “But your experience would be exactly the same as if the world was awesome. You wouldn’t be able to tell the difference”, they say. It’s obvious to them that they would prefer the experience machine, as long as their memory was wiped so they didn’t know they were in one.[1])
Talking about an epistemic process attempting to update your model of an underlying not-really-real-but-sorta structure seems to miss the degrees of freedom in the game. Since there’s no real apple, no one has any principled place to stand in claiming that “the apple really went half blue right there” vs. “no the TV signal was just interrupted.” Any question about what the apple is “really doing” is a dangling node. [2]
As a separate point, while I agree the “explaining away disruptions” phenomenon is ever a thing that happens, I don’t think that’s usually what’s happening when a person reflects on their values. Rather I guess that it’s one of the three options that I suggested above.
Tangentially, this is why I expect that the CEV of humans diverges. I think some humans, on maximal reflection, wirehead, and others don’t.
Admittedly, I think the question of which extrapolation schema to use is itself decided by “your values”, which ultimately grounds out in the reward data. Some people have perhaps a stronger feeling of indigence about others hiding information from them, or perhaps a stronger sense of curiosity, or whatever, that crystalizes into a general desire to know what’s true. Other people have less of that. And so they have different responses to the experience-machine hypothetical.
Because which extrapolation procedure any given person decides to use is itself a function of “their values” it all grounds out in the reward data eventually. Which perhaps defeats my point here.
I’d classify this as an ordinary epistemic phenomenon. It came up in this thread with Richard just a couple days ago.
Core idea: when plain old Bayesian world models contain latent variables, it is ordinary for those latent variables to have some irreducible uncertainty—i.e. we’d still have some uncertainty over them even after updating on the entire physical state of the world. The latents can still be predictively useful and meaningful, they’re just not fully determinable from data, even in principle.
Standard example (copied from the thread with Richard): the Boltzman distribution for an ideal gas—not the assorted things people say about the Boltzmann distribution, but the actual math, interpreted as Bayesian probability. The model has one latent variable, the temperature T, and says that all the particle velocities are normally distributed with mean zero and variance proportional to T. Then, just following the ordinary Bayesian math: in order to estimate T from all the particle velocities, I start with some prior P[T], calculate P[T|velocities] using Bayes’ rule, and then for ~any reasonable prior I end up with a posterior distribution over T which is very tightly peaked around the average particle energy… but has nonzero spread. There’s small but nonzero uncertainty in T given all of the particle velocities. And in this simple toy gas model, those particles are the whole world, there’s nothing else to learn about which would further reduce my uncertainty in T.
Bringing it back to wireheading: first, the wireheader and the non-wireheader might just have different rewards; that’s not the conceptually interesting case, but it probably does happen. The interesting case is that the two people might have different value-estimates given basically-similar rewards, and that difference cannot be resolved by data because (like temperature in the above example) the values-latent is underdetermined by the data. In that case, the difference would be in the two peoples’ priors, which would be physiologically-embedded somehow.