“Agent A has preferences R” is not a fact about the world. It is a stance about A, or an interpretation of A. A stance or an interpretation that we choose to take, for some purpose or reason.
I find it hard to imagine that you’re actually denying that you or I have things that, colloquially, one would describe as preferences, and exist in an objective sense. I do have a preference for a happy and meaningful life over a life of pure agony. Anyone who thinks I do not is factually wrong about the state of the world.
Then there is a sense in which the interpretations of these systems we build are fully interpretative. If “preferences R” refers to a function returning a real number, for sure this is not some facet of the real world, and there are many such seemingly-different models for any agent. Here again I believe we agree.
But we seem not to be agreeing at the next step, with the preference stance. Here I claim your goal should not be to maximize the function “preferences R”, whose precise values are irrelevant and independent, but to maximise the actual human preferences.
Consider measuring a simpler system, temperature, and projecting this onto some number. Clearly, depending on how you do this projection, you can end up at any number for a given temperature. Even with a simplicity prior, higher temperatures can correspond to larger numbers or smaller numbers in the projection, with pretty much equal plausibility. So even in this simplified situation, where we can agree that some temperatures are objectively higher than others, you cannot reliably maximize temperature by maximizing its projection.
Your preference function is a projection. The arbitrary choices you have to make to build this function are not assumptions about the world, they are choices about the model. When you prove that you have many models of human preference, you are not proving that preference is entirely subjective.
That’s why, when you use empathy to figure out someone’s goals and rationality, this also allows you to better predict them. But this is a fact about you (and me), not about the world. Just as “Thor is angry” is actually much more complex than electromagnetism, our prediction of other people via our empathy machine is simpler for us to do—but is actually more complex for an agent that doesn’t already have this empathy machinery to draw on.
This Thor analogy is… enlightening of the differences in our perspectives. Imagining an angry Thor is a much more complex hypothesis up until the point you see an actual Thor in the sky hurling spears of lightning. Then it becomes the only reasonable conclusion, because although brains seem like they involve a lot of assumptions, a brain is ultimately many fewer assumptions (to the pre-industrial Norse people) than that same amount of coincidence.
This is the point I am making with people. If your computer models people as arbitrary, randomly sampled programs, of course you struggle to distinguish human behaviour from their contrapositives. However, people are not fully independent, nor arbitrary computing systems. Arguing that a physical person optimizing competently for a good outcome and a physical person optimizing nega-competently for a bad outcome are similarly simple has to overcome at least two hurdles:
1. We seem to know things about which mental states are good and which mental states are bad. This implies there is objective knowledge that can be learnt about it.
2. You would need to extend your arguments about mathematical functions into the real world. I don’t know how this could be approached.
I have a hard time believing that in another world people think that the qualia corresponding to our suffering is good and the qualia corresponding to our happiness is bad, and if it is, this strikes me as a much bigger deal than anything else you are saying.
I find it hard to imagine that you’re actually denying that you or I have things that, colloquially, one would describe as preferences, and exist in an objective sense.
I deny that a generic outside observer would describe us as having any specific set of preferences, in an objective sense.
This doesn’t bother me too much, because it’s sufficient that we have preferences in a subjective sense—that we can use our own empathy modules and self-reflection to define, to some extent, our preferences.
a brain is ultimately many fewer assumptions (to the pre-industrial Norse people)
“Realistic” preferences make ultimately fewer assumptions (to actual humans) that “fully rational” or other preference sets.
The problem is that this is not true for generic agents, or AIs. We have to get the human empathy module into the AI first—not so it can predict us (it can already do that through other means), but so that its decomposition of our preferences is the same as ours.
I deny that a generic outside observer would describe us as having any specific set of preferences, in an objective sense.
It’s possible that we’ve been struggling with this conversation because I’ve been failing to grasp just how radically different your opinions are to mine.
Imagine your generic outside observer was superintelligent, and understood (through pure analysis) qualia and all the corresponding mysteries of the mind. Would you then still say this outside observer would not consider us to have any specific set of preferences, in an objective sense, where “preferences” takes on its colloquial meaning?
If not, why? I think my stance is obvious; where preferences colloquially means approximately “a greater liking for one alternative over another or others”, all I have to claim is that there is an objective sense in which I like things, which is simple because there’s an objective sense in which I have that emotional state and internal stance.
I find it hard to imagine that you’re actually denying that you or I have things that, colloquially, one would describe as preferences, and exist in an objective sense. I do have a preference for a happy and meaningful life over a life of pure agony. Anyone who thinks I do not is factually wrong about the state of the world.
Then there is a sense in which the interpretations of these systems we build are fully interpretative. If “preferences R” refers to a function returning a real number, for sure this is not some facet of the real world, and there are many such seemingly-different models for any agent. Here again I believe we agree.
But we seem not to be agreeing at the next step, with the preference stance. Here I claim your goal should not be to maximize the function “preferences R”, whose precise values are irrelevant and independent, but to maximise the actual human preferences.
Consider measuring a simpler system, temperature, and projecting this onto some number. Clearly, depending on how you do this projection, you can end up at any number for a given temperature. Even with a simplicity prior, higher temperatures can correspond to larger numbers or smaller numbers in the projection, with pretty much equal plausibility. So even in this simplified situation, where we can agree that some temperatures are objectively higher than others, you cannot reliably maximize temperature by maximizing its projection.
Your preference function is a projection. The arbitrary choices you have to make to build this function are not assumptions about the world, they are choices about the model. When you prove that you have many models of human preference, you are not proving that preference is entirely subjective.
This Thor analogy is… enlightening of the differences in our perspectives. Imagining an angry Thor is a much more complex hypothesis up until the point you see an actual Thor in the sky hurling spears of lightning. Then it becomes the only reasonable conclusion, because although brains seem like they involve a lot of assumptions, a brain is ultimately many fewer assumptions (to the pre-industrial Norse people) than that same amount of coincidence.
This is the point I am making with people. If your computer models people as arbitrary, randomly sampled programs, of course you struggle to distinguish human behaviour from their contrapositives. However, people are not fully independent, nor arbitrary computing systems. Arguing that a physical person optimizing competently for a good outcome and a physical person optimizing nega-competently for a bad outcome are similarly simple has to overcome at least two hurdles:
1. We seem to know things about which mental states are good and which mental states are bad. This implies there is objective knowledge that can be learnt about it.
2. You would need to extend your arguments about mathematical functions into the real world. I don’t know how this could be approached.
I have a hard time believing that in another world people think that the qualia corresponding to our suffering is good and the qualia corresponding to our happiness is bad, and if it is, this strikes me as a much bigger deal than anything else you are saying.
I deny that a generic outside observer would describe us as having any specific set of preferences, in an objective sense.
This doesn’t bother me too much, because it’s sufficient that we have preferences in a subjective sense—that we can use our own empathy modules and self-reflection to define, to some extent, our preferences.
“Realistic” preferences make ultimately fewer assumptions (to actual humans) that “fully rational” or other preference sets.
The problem is that this is not true for generic agents, or AIs. We have to get the human empathy module into the AI first—not so it can predict us (it can already do that through other means), but so that its decomposition of our preferences is the same as ours.
It’s possible that we’ve been struggling with this conversation because I’ve been failing to grasp just how radically different your opinions are to mine.
Imagine your generic outside observer was superintelligent, and understood (through pure analysis) qualia and all the corresponding mysteries of the mind. Would you then still say this outside observer would not consider us to have any specific set of preferences, in an objective sense, where “preferences” takes on its colloquial meaning?
If not, why? I think my stance is obvious; where preferences colloquially means approximately “a greater liking for one alternative over another or others”, all I have to claim is that there is an objective sense in which I like things, which is simple because there’s an objective sense in which I have that emotional state and internal stance.