Fair warning, the following is pretty sketchy and I wouldn’t bet I’d stick with it if I thought a bit longer.
---
Imagine a simple computer running a simple chess playing program. The program uses purely integer computation, except to calculate its reward function and to run minimax over them, which is in floating point. The search looks for the move that maximizes the outcome, which corresponds to a win.
This, if I understand your parlance, is ‘rational’ behaviour.
Now consider that the reward is negated, and the planner instead looks for the move that minimizes the outcome.
This, if I understand your parlance, is ‘anti-rational’ behaviour.
Now consider that this anti-rational program is run on a machine where floating point values encoded with a sign bit ‘1’ represent a positive number and those with a ‘0’ sign bit a negative number—the opposite to the standard encoding.
It’s the same ‘anti-rational’ program, but exactly the same wires are lit up in the same pattern on this hardware as with the ‘rational’ program on the original hardware.
In what sense can you say the difference between rationality and anti-rationality at all exists in the program (or in humans), rather than in the model of them, when the same wires are both rational and anti-rational? I believe the same dilemma holds for indifferent planners. It doesn’t seem like reward functions of the type your paper talks about are a real thing, at least in a sense independent of interpretation, so it makes sense that you struggle to distinguish them when they aren’t there to distinguish.
---
I am tempted to base an argument off the claim that misery is avoided because it’s bad rather than being bad because it’s avoided. If true, this shortcuts a lot of your concern: reward functions exist only in the map, where numbers and abstract symbols can be flipped arbitrarily, but in the physical world these good and bad states have intrinsic quality to them and can be distinguished meaningfully. Thus the question is not how to distinguish indistinguishable reward functions, but how to understand this aspect of qualitative experience. Then, presumably, if a computer could understand what the experience of unhappiness is like, it would not have to assume our preferences.
This doesn’t help solve the mystery. Why couldn’t a species evolve to maximise its negative internal emotional states? We can’t reasonably have gotten preference and optimization lined up by pure coincidence, so there must be a reason. But it seems like a more reasonable stance to shove the question off into the ineffable mysteries of qualia than to conflate it with a formalism that seems necessarily independent of the thing we’re trying to measure.
But that doesn’t detract from the main point: that simplicity, on its own, is not sufficient to resolve the issue.
It kind of does. You have shown that simplicity cannot distinguish (p, R) from (-p, -R), but you have not shown that simplicity cannot distinguish a physical person optimizing competently for a good outcome from a physical person optimizing nega-competently for a bad outcome.
If it seems unreasonable for there to be a difference, consider a similar map-territory distinction of a height map to a mountain. An optimization function that gradient descents on a height map is the same complexity, or nearabouts, as one that gradient ascents on the height map’s inverse. However, a system that physically gradient descents on the actual mountains can be much simpler than one that gradient ascents on the mountain’s inverse. Since negative mental experiences are somehow qualitatively different to positive ones, it would not surprise me much if they did in fact effect a similar asymmetry here.
Saying that an agent has a preference/reward R is an interpretation of that agent (similar to the “intentional stance” of seeing it as an agent, rather than a collection of atoms). And the (p,R) and (-p,-R) interpretations are (almost) equally complex.
One of us is missing what the other is saying. I’m honestly not sure what argument you are putting forth here.
I agree that preference/reward is an interpretation (the terms I used were map and territory). I agree that (p,R) and (-p,-R) are approximately equally complex. I do not agree that complexity is necessarily isomorphic between the map and the territory. This means although the model might be a strong analogy when talking about behaviour, it is sketchy to use it as a model for complexity of behaviour.
Obviously misery would be avoided because it’s bad, not the other way around. We are trying to figure out what is bad by seeing what we avoid. And the problem remains whether we might be accidentally avoiding misery, while trying to avoid its opposite.
Obviously misery would be avoided because it’s bad, not the other way around.
As mentioned, this isn’t obvious to me, so I’d be interested in your reasoning. Why should evolution build systems that want to avoid intrinsically bad mental states?
We are trying to figure out what is bad by seeing what we avoid. And the problem remains whether we might be accidentally avoiding misery, while trying to avoid its opposite.
Yes, my point here was twofold. One, the formalism used in the paper does not seem to be deeply meaningful, so it would be best to look for some other angle of attack. Two, given the claim about intrinsic badness, the programmer is embedding domain knowledge (about conscious states), not unlearnable assumptions. A computer system would fail to learn this because qualia is a hard problem, not because it’s unlearnable. This makes it asymmetric and circumventable in a way that the no free lunch theorem is not.
Fair warning, the following is pretty sketchy and I wouldn’t bet I’d stick with it if I thought a bit longer.
---
Imagine a simple computer running a simple chess playing program. The program uses purely integer computation, except to calculate its reward function and to run minimax over them, which is in floating point. The search looks for the move that maximizes the outcome, which corresponds to a win.
This, if I understand your parlance, is ‘rational’ behaviour.
Now consider that the reward is negated, and the planner instead looks for the move that minimizes the outcome.
This, if I understand your parlance, is ‘anti-rational’ behaviour.
Now consider that this anti-rational program is run on a machine where floating point values encoded with a sign bit ‘1’ represent a positive number and those with a ‘0’ sign bit a negative number—the opposite to the standard encoding.
It’s the same ‘anti-rational’ program, but exactly the same wires are lit up in the same pattern on this hardware as with the ‘rational’ program on the original hardware.
In what sense can you say the difference between rationality and anti-rationality at all exists in the program (or in humans), rather than in the model of them, when the same wires are both rational and anti-rational? I believe the same dilemma holds for indifferent planners. It doesn’t seem like reward functions of the type your paper talks about are a real thing, at least in a sense independent of interpretation, so it makes sense that you struggle to distinguish them when they aren’t there to distinguish.
---
I am tempted to base an argument off the claim that misery is avoided because it’s bad rather than being bad because it’s avoided. If true, this shortcuts a lot of your concern: reward functions exist only in the map, where numbers and abstract symbols can be flipped arbitrarily, but in the physical world these good and bad states have intrinsic quality to them and can be distinguished meaningfully. Thus the question is not how to distinguish indistinguishable reward functions, but how to understand this aspect of qualitative experience. Then, presumably, if a computer could understand what the experience of unhappiness is like, it would not have to assume our preferences.
This doesn’t help solve the mystery. Why couldn’t a species evolve to maximise its negative internal emotional states? We can’t reasonably have gotten preference and optimization lined up by pure coincidence, so there must be a reason. But it seems like a more reasonable stance to shove the question off into the ineffable mysteries of qualia than to conflate it with a formalism that seems necessarily independent of the thing we’re trying to measure.
It’s because of concerns like this that we have to solve the symbol grounding problem for the human we are trying to model; see, eg, https://www.lesswrong.com/posts/EEPdbtvW8ei9Yi2e8/bridging-syntax-and-semantics-empirically
But that doesn’t detract from the main point: that simplicity, on its own, is not sufficient to resolve the issue.
It kind of does. You have shown that simplicity cannot distinguish (p, R) from (-p, -R), but you have not shown that simplicity cannot distinguish a physical person optimizing competently for a good outcome from a physical person optimizing nega-competently for a bad outcome.
If it seems unreasonable for there to be a difference, consider a similar map-territory distinction of a height map to a mountain. An optimization function that gradient descents on a height map is the same complexity, or nearabouts, as one that gradient ascents on the height map’s inverse. However, a system that physically gradient descents on the actual mountains can be much simpler than one that gradient ascents on the mountain’s inverse. Since negative mental experiences are somehow qualitatively different to positive ones, it would not surprise me much if they did in fact effect a similar asymmetry here.
Saying that an agent has a preference/reward R is an interpretation of that agent (similar to the “intentional stance” of seeing it as an agent, rather than a collection of atoms). And the (p,R) and (-p,-R) interpretations are (almost) equally complex.
One of us is missing what the other is saying. I’m honestly not sure what argument you are putting forth here.
I agree that preference/reward is an interpretation (the terms I used were map and territory). I agree that (p,R) and (-p,-R) are approximately equally complex. I do not agree that complexity is necessarily isomorphic between the map and the territory. This means although the model might be a strong analogy when talking about behaviour, it is sketchy to use it as a model for complexity of behaviour.
I tried to answer in more detail here: https://www.lesswrong.com/posts/f5p7AiDkpkqCyBnBL/preferences-as-an-instinctive-stance (hope you didn’t mind; I used your comment as a starting point for a major point I wanted to clarify).
But I admit to being confused now, and not understanding what you mean. Preferences don’t exist in the territory, so I’m not following you, sorry! :-(
Obviously misery would be avoided because it’s bad, not the other way around. We are trying to figure out what is bad by seeing what we avoid. And the problem remains whether we might be accidentally avoiding misery, while trying to avoid its opposite.
As mentioned, this isn’t obvious to me, so I’d be interested in your reasoning. Why should evolution build systems that want to avoid intrinsically bad mental states?
Yes, my point here was twofold. One, the formalism used in the paper does not seem to be deeply meaningful, so it would be best to look for some other angle of attack. Two, given the claim about intrinsic badness, the programmer is embedding domain knowledge (about conscious states), not unlearnable assumptions. A computer system would fail to learn this because qualia is a hard problem, not because it’s unlearnable. This makes it asymmetric and circumventable in a way that the no free lunch theorem is not.