For any given fuckery with my reward signals, I could call them errors, misrepresenting my “true values” or I could embrace them as expressing a part of my “true values.” And if two people disagree about which conceptualization to go with, I don’t know how they could possibly resolve it. They’re both valid frames, fully consistent with the data. And they can’t get distinguishing evidence, even in principle.
I’d classify this as an ordinary epistemic phenomenon. It came up in this thread with Richard just a couple days ago.
Core idea: when plain old Bayesian world models contain latent variables, it is ordinary for those latent variables to have some irreducible uncertainty—i.e. we’d still have some uncertainty over them even after updating on the entire physical state of the world. The latents can still be predictively useful and meaningful, they’re just not fully determinable from data, even in principle.
Standard example (copied from the thread with Richard): the Boltzman distribution for an ideal gas—not the assorted things people say about the Boltzmann distribution, but the actual math, interpreted as Bayesian probability. The model has one latent variable, the temperature T, and says that all the particle velocities are normally distributed with mean zero and variance proportional to T. Then, just following the ordinary Bayesian math: in order to estimate T from all the particle velocities, I start with some prior P[T], calculate P[T|velocities] using Bayes’ rule, and then for ~any reasonable prior I end up with a posterior distribution over T which is very tightly peaked around the average particle energy… but has nonzero spread. There’s small but nonzero uncertainty in T given all of the particle velocities. And in this simple toy gas model, those particles are the whole world, there’s nothing else to learn about which would further reduce my uncertainty in T.
Bringing it back to wireheading: first, the wireheader and the non-wireheader might just have different rewards; that’s not the conceptually interesting case, but it probably does happen. The interesting case is that the two people might have different value-estimates given basically-similar rewards, and that difference cannot be resolved by data because (like temperature in the above example) the values-latent is underdetermined by the data. In that case, the difference would be in the two peoples’ priors, which would be physiologically-embedded somehow.
I’d classify this as an ordinary epistemic phenomenon. It came up in this thread with Richard just a couple days ago.
Core idea: when plain old Bayesian world models contain latent variables, it is ordinary for those latent variables to have some irreducible uncertainty—i.e. we’d still have some uncertainty over them even after updating on the entire physical state of the world. The latents can still be predictively useful and meaningful, they’re just not fully determinable from data, even in principle.
Standard example (copied from the thread with Richard): the Boltzman distribution for an ideal gas—not the assorted things people say about the Boltzmann distribution, but the actual math, interpreted as Bayesian probability. The model has one latent variable, the temperature T, and says that all the particle velocities are normally distributed with mean zero and variance proportional to T. Then, just following the ordinary Bayesian math: in order to estimate T from all the particle velocities, I start with some prior P[T], calculate P[T|velocities] using Bayes’ rule, and then for ~any reasonable prior I end up with a posterior distribution over T which is very tightly peaked around the average particle energy… but has nonzero spread. There’s small but nonzero uncertainty in T given all of the particle velocities. And in this simple toy gas model, those particles are the whole world, there’s nothing else to learn about which would further reduce my uncertainty in T.
Bringing it back to wireheading: first, the wireheader and the non-wireheader might just have different rewards; that’s not the conceptually interesting case, but it probably does happen. The interesting case is that the two people might have different value-estimates given basically-similar rewards, and that difference cannot be resolved by data because (like temperature in the above example) the values-latent is underdetermined by the data. In that case, the difference would be in the two peoples’ priors, which would be physiologically-embedded somehow.