Good point, but the fact that humans are consequentialists (at least partly) doesn’t seem to make the problem much easier. Suppose we replace Yvain’s blue-minimizer robot with a simple consequentialist robot that has the same behavior (let’s say it models the world as a 2D grid of cells that have intrinsic color, it always predicts that any blue cell that it shoots at will turn some other color, and its utility function assigns negative utility to the existence of blue cells). What does this robot “actually want”, given that the world is not really a 2D grid of cells that have intrinsic color?
What does this robot “actually want”, given that the world is not really a 2D grid of cells that have intrinsic color?
Who cares about the question what the robot “actually wants”? Certainly not the robot. Humans care about the question what they “actually want”, but that’s because they have additional structure that this robot lacks. But with humans, you’re not limited to just looking at what they do on auto-pilot; instead, you can just ask the aforementioned structure when you run into problems like this. For example, if you asked me what I really wanted under some weird ontology change, I could say, “I have some guesses, but I don’t really know; I would like to defer to a smarter version of me”. That’s how I understand preference extrapolation: not as something that looks at what your behavior suggests that you’re trying to do and then does it better, but as something that poses the question of what you want to some system you’d like to answer the question for you.
It looks to me like there’s a mistaken tendency among many people here, including some very smart people, to say that I’d be irrational to let my stated preferences deviate from my revealed preferences; that just because I seem to be trying to do something (in some sense like: when my behavior isn’t being controlled much by the output of moral philosophy, I can be modeled as a relatively good fit to a robot with some particular utility function), that’s a reason for me to do it even if I decide that I don’t want to. But rational utility maximizers get to be indifferent to whatever the heck they want, including their own preferences, so it’s hard for me to see why the underdeterminedness of the true preferences of robots like this should bother me at all.
Insert standard low confidence about me posting claims on complicated topics that others seem to disagree with.
That would imply a great diversity of value systems, because philosophical intuitions differ much more from person to person than primitive desires. Some of these value systems (maybe including yours) would be simple, some wouldn’t. For example, my “philosophical” values seem to give large weight to my “primitive” values.
preference extrapolation: not as something that looks at what your behavior suggests that you’re trying to do and then does it better, but as something that poses the question of what you want to some system you’d like to answer the question for you
That might be a procedure that generates human preference, but it is not a general preference extrapolation procedure. E.g suppose we replace Wei Dai’s simple consequentialist robot with a robot that has similar behavior, but that also responds to the question, “What system do you want to answer the question of what you want for you?” with the answer, “A version of myself better able to answer that question. Maybe it should be smarter and know more things and be nicer to strangers and not have scope insensitivity and be less prone to skipping over invisible moral frameworks and have conecepts that are better defined over attribute space and be automatically strategic and super commited and stuff like that? But since I’m not that smart and I pass over moral frameworks and stuff, eveything I just said is probably insufficient to specify the right thing. Maybe you can look at my source code and figure out what I mean by right and then do the thing that a person who better understood that would do?” And then goes right back to zapping blue.
Suppose we replace Wei Dai’s simple consequentialist robot with a robot that has similar behavior, but that also responds to the question, “What system do you want to answer the question of what you want for you?” with the answer, “I want to decide for myself” and responds to the question, “What do you want to do?” with the answer, “I want to make babies happy. Oh and help grandmother out of the burning building. Oh, and without killing her. Oh, and to preserve complex novelty. Oh and boredom. Oh, and there should still be people in the world who are trying to improve it. Oh and...damnit, this is complicated. Okay, never mind, I want you to ask the version of myself who I presently think is smart enough to answer this question and who knows what the right thing to do is even better than me.”
It can answer those two questions, but if you ask it to clarify the last response, it just blows up.
Actually, this notion of consequentialism gives a new and the only clue I know of about how to infer agent goals, or how to constrain the kinds of considerations that should be considered goals, as compared to the other stuff that moves your action incidentally, such as psychological drives or laws of physics. I wonder if Eliezer had this insight before, given that he wrote a similar comment to this thread. I wasn’t ready to see this idea on my own until a few weeks ago, and this thread is the first time I thought about the question given the new framework, and saw the now-obvious construction. This deserves more than a comment, so I’ll be working on a two-post sequence to write this up intelligibly. Or maybe it’s actually just stupid, I’ll try to figure that out.
(A summary from my notes, in case I get run over by a bus; this uses a notion of “dependence” for which a toy example is described in my post on ADT, but which is much more general: )
The idea of consequentialism, of goal-directed control, can be modeled as follows. If a fact A is controlled by (can be explained/predicted based on) a dependence F: A->O, then we say that A is a decision (action) driven by a consequentialist consideration F, which in turn looks at how A controls the morally relevant fact O.
For a given decision A, there could be many different morally relevant facts O such that the dependence A->O has explanatory power about A. The more about A can a dependence A->O explain, the more morally relevant O is. Finding highly relevant facts O essentially captures A’s goals.
This model has two good properties. First, logical omniscience (in particular, just knowledge of actual action) renders the construction unusable, since we need to see dependencies A->O as ambient concepts explaining A, so both A and A->O need to remain potentially unknown. (This is the confusing part. It also lends motivation to the study of complete collection of moral arguments and the nature of agent-provable collection of moral arguments.)
Second, action (decision) itself, and many other facts that control the action but aren’t morally relevant, are distinguished by this model from the things that are. For example, A can’t be morally relevant, for that would require the trivial identity dependence A->A to explain A, which it can’t, since it’s too simple. Similarly for other stuff in simple relationship with A: the relationship between A and a fact must be in tune with A for the fact to be morally relevant, it’s not enough for the fact itself to be in tune with A.
This question doesn’t require a fixed definition for a goal concept, instead it shows how various concepts can be regarded as goals, and how their suitability for this purpose can be compared. The search for better morally relevant facts is left open-ended.
For the record, I mostly completed a draft of a prerequisite post (first of the two I had in mind) a couple of weeks ago, and it’s just no good, not much better than what one would take away from reading the previously published posts, and not particularly helpful in clarifying the intuition expressed in the above comments. So I’m focusing on improving my math skills, which I expect will help with formalization/communication problem (given a few months), as well as with moving forward. I might post some version of the post, but it seems it won’t be able to serve the previously intended purpose.
As for communication, it would help me (at least) if you used words in their normal senses unless they are a standard LW term of art (e.g. ‘rationalist’ means LW rationalist not Cartesian rationalist ) or unless you specify that you’re using the term in an uncommon sense.
It isn’t related to this thread. I was thinking of past confusions between us over ‘metaethics’ and ‘motivation’ and ‘meaning’ where I didn’t realize until pretty far into the discussion that you were using these terms to mean something different than they normally mean. I’d generally like to avoid that kind of thing; that’s all I meant.
Well, I’m mostly not interested in the concepts corresponding to how these words are “normally” used. The miscommunication problems resulted from both me misinterpreting your usage, and your misinterpreting my usage. I won’t be misinterpreting your usage in similar cases in the future, as I now know better what you mean by which words, and in my own usage, as we discussed a couple of times, I’ll be more clear through using disambiguating qualifiers, which in most cases amounts to writing word “normative” more frequently.
(Still unclear/strange why you brought it up in this particular context, but no matter...)
The robot wants to minimize the amount of blue it sees in its grid representation of the world. It can do this by affecting the world with a laser. But it could also change its camera system so that it sees less blue. If there is no term in the utility function that says that the grid has to model reality, then both approaches are equally valid.
steven0461′s comment notwithstanding, I can take a guess at what the robot actually wants. I think it wants to take the action that will minimize the number of blue cells existing in the world, according to the robot’s current model of the world. That rule for choosing actions probably doesn’t correspond to any coherent utility function over the real world, but that’s not really a surprise.
The interesting question that you probably meant to ask is whether the robot’s utility function over its model of the world can be converted to a utility function over the real world. But the robot won’t agree to any such upgrade, so the question is kinda moot.
That might sound hopeless for CEV, but fortunately humans aren’t consequentialists with a fixed model of the world. Instead they seem to be motivated by pleasure and pain, which you can’t disprove out of existence by coming up with a better model. So maybe there’s hope in that direction.
To avoid SEEING blue things. If the model is good enough for it it’d search out a mirror and laser it’s own camera so that it could NEVER see a blue pixel again.
This can be modelled using human empathy by equating the sensation of seeing blue with pain. You don’t care to minimize damage to your body (if it’s not somehting that actually cripples you), but you care about not getting the signal about it happening, and your reaction to a pill that turned you masochist would be very different than your reaction to a murder pill.
Edit: huh? I am surprised that this is downvoted, and the most probable reason is that I’m wrong in some obvious way that I can’t see, can someone please tell me how?
(Or maybe my usage of empathy was interpreted way to literally. )
To answer your thought experiment. It doesn’t matter what the agent thinks it’s acting based on, we look at it from the outside instead (but using a particular definition/dependence that specifies the agent), and ask how its action depends on the dependence of the actual future on its actual action. Agent’s misconceptions don’t enter this question. If the misconceptions are great, it’ll turn out that the dependence of actual future on agent’s action doesn’t control its action, or controls it in some unexpected way. Alternatively, we could say that it’s not the actual future that is morally relevant for the agent, but some other strange fact, in which case the agent could be said to be optimizing a world that is not ours. From yet another perspective, the role of the action could be played by something else, but then it’s not clear why we are considering such a model and talking about this particular actual agent at the same time.
Is that something you can see from the outside? If I argmax over actions in expected-paper-clips or over updateless-prior-expected-paper-clips, how can you translate my black box behavior over possible worlds into the dependence of my behavior on the dependence of the worlds on my behavior?
See the section “Utility functions” of this post: it shows how a dependence between two fixed facts could be restored in an ideal case where we can learn everything there is to learn about it. Similarly, you could consider the fact of which dependence holds between two facts, with various specific functions as its possible values, and ask what can you infer about that other fact if you assume that the dependence is given by a certain function.
More generally, a dependence follows possible inferences, things that could be inferred about one fact if you learn new things about the other fact. It needs to follow all of such inferences, to the best of agent’s ability, otherwise it won’t be right and you’ll get incorrect decisions (counterfactual models).
Edit: Actually, never mind, I missed your point. Will reply again later. (Done.)
What does this robot “actually want”, given that the world is not really a 2D grid of cells that have intrinsic color?
Our world is not really a 2D grid, its world could be. It won’t be a consequentialist about our world then, for that requires the dependence of its decisions on the dependence of our world on its decisions. It looks like the robot you described wants to minimize the 2D grid greenness, and would do that sensibly in the context of 2D grid world, or any world that can influence the 2D grid world. For the robot, our world doesn’t exist in the same sense as 2D grid world doesn’t exist for us, even though we could build an instance of the robot in our world. If we do build such an instance, the robot, if extremely rational and not just acting on heuristics adapted for its natural habitat, could devote its our-worldly existence to finding ways of acausally controlling its 2D world. For example, if there are some 2D-worlders out there simulating our world, it could signal to them something that is expected to reduce greenness.
(This all depends on the details of robot’s decision-making tools, of course. It could really be talking about our world, but then its values collapse, and it could turn out to not be a consequentialist after all, or optimizing something very different, to the extent the conflict in the definitions is strong.)
Good point, but the fact that humans are consequentialists (at least partly) doesn’t seem to make the problem much easier. Suppose we replace Yvain’s blue-minimizer robot with a simple consequentialist robot that has the same behavior (let’s say it models the world as a 2D grid of cells that have intrinsic color, it always predicts that any blue cell that it shoots at will turn some other color, and its utility function assigns negative utility to the existence of blue cells). What does this robot “actually want”, given that the world is not really a 2D grid of cells that have intrinsic color?
Who cares about the question what the robot “actually wants”? Certainly not the robot. Humans care about the question what they “actually want”, but that’s because they have additional structure that this robot lacks. But with humans, you’re not limited to just looking at what they do on auto-pilot; instead, you can just ask the aforementioned structure when you run into problems like this. For example, if you asked me what I really wanted under some weird ontology change, I could say, “I have some guesses, but I don’t really know; I would like to defer to a smarter version of me”. That’s how I understand preference extrapolation: not as something that looks at what your behavior suggests that you’re trying to do and then does it better, but as something that poses the question of what you want to some system you’d like to answer the question for you.
It looks to me like there’s a mistaken tendency among many people here, including some very smart people, to say that I’d be irrational to let my stated preferences deviate from my revealed preferences; that just because I seem to be trying to do something (in some sense like: when my behavior isn’t being controlled much by the output of moral philosophy, I can be modeled as a relatively good fit to a robot with some particular utility function), that’s a reason for me to do it even if I decide that I don’t want to. But rational utility maximizers get to be indifferent to whatever the heck they want, including their own preferences, so it’s hard for me to see why the underdeterminedness of the true preferences of robots like this should bother me at all.
Insert standard low confidence about me posting claims on complicated topics that others seem to disagree with.
In other words, our “actual values” come from our being philosophers, not our being consequentialists.
It seems plausible to me, and I’m not sure that “many” others do disagree with you.
That would imply a great diversity of value systems, because philosophical intuitions differ much more from person to person than primitive desires. Some of these value systems (maybe including yours) would be simple, some wouldn’t. For example, my “philosophical” values seem to give large weight to my “primitive” values.
That might be a procedure that generates human preference, but it is not a general preference extrapolation procedure. E.g suppose we replace Wei Dai’s simple consequentialist robot with a robot that has similar behavior, but that also responds to the question, “What system do you want to answer the question of what you want for you?” with the answer, “A version of myself better able to answer that question. Maybe it should be smarter and know more things and be nicer to strangers and not have scope insensitivity and be less prone to skipping over invisible moral frameworks and have conecepts that are better defined over attribute space and be automatically strategic and super commited and stuff like that? But since I’m not that smart and I pass over moral frameworks and stuff, eveything I just said is probably insufficient to specify the right thing. Maybe you can look at my source code and figure out what I mean by right and then do the thing that a person who better understood that would do?” And then goes right back to zapping blue.
Suppose we replace Wei Dai’s simple consequentialist robot with a robot that has similar behavior, but that also responds to the question, “What system do you want to answer the question of what you want for you?” with the answer, “I want to decide for myself” and responds to the question, “What do you want to do?” with the answer, “I want to make babies happy. Oh and help grandmother out of the burning building. Oh, and without killing her. Oh, and to preserve complex novelty. Oh and boredom. Oh, and there should still be people in the world who are trying to improve it. Oh and...damnit, this is complicated. Okay, never mind, I want you to ask the version of myself who I presently think is smart enough to answer this question and who knows what the right thing to do is even better than me.”
It can answer those two questions, but if you ask it to clarify the last response, it just blows up.
Actually, this notion of consequentialism gives a new and the only clue I know of about how to infer agent goals, or how to constrain the kinds of considerations that should be considered goals, as compared to the other stuff that moves your action incidentally, such as psychological drives or laws of physics. I wonder if Eliezer had this insight before, given that he wrote a similar comment to this thread. I wasn’t ready to see this idea on my own until a few weeks ago, and this thread is the first time I thought about the question given the new framework, and saw the now-obvious construction. This deserves more than a comment, so I’ll be working on a two-post sequence to write this up intelligibly. Or maybe it’s actually just stupid, I’ll try to figure that out.
(A summary from my notes, in case I get run over by a bus; this uses a notion of “dependence” for which a toy example is described in my post on ADT, but which is much more general: )
The idea of consequentialism, of goal-directed control, can be modeled as follows. If a fact A is controlled by (can be explained/predicted based on) a dependence F: A->O, then we say that A is a decision (action) driven by a consequentialist consideration F, which in turn looks at how A controls the morally relevant fact O.
For a given decision A, there could be many different morally relevant facts O such that the dependence A->O has explanatory power about A. The more about A can a dependence A->O explain, the more morally relevant O is. Finding highly relevant facts O essentially captures A’s goals.
This model has two good properties. First, logical omniscience (in particular, just knowledge of actual action) renders the construction unusable, since we need to see dependencies A->O as ambient concepts explaining A, so both A and A->O need to remain potentially unknown. (This is the confusing part. It also lends motivation to the study of complete collection of moral arguments and the nature of agent-provable collection of moral arguments.)
Second, action (decision) itself, and many other facts that control the action but aren’t morally relevant, are distinguished by this model from the things that are. For example, A can’t be morally relevant, for that would require the trivial identity dependence A->A to explain A, which it can’t, since it’s too simple. Similarly for other stuff in simple relationship with A: the relationship between A and a fact must be in tune with A for the fact to be morally relevant, it’s not enough for the fact itself to be in tune with A.
This question doesn’t require a fixed definition for a goal concept, instead it shows how various concepts can be regarded as goals, and how their suitability for this purpose can be compared. The search for better morally relevant facts is left open-ended.
I very much look forward to your short sequence on this. I hope you will also explain your notion of dependence in detail.
For the record, I mostly completed a draft of a prerequisite post (first of the two I had in mind) a couple of weeks ago, and it’s just no good, not much better than what one would take away from reading the previously published posts, and not particularly helpful in clarifying the intuition expressed in the above comments. So I’m focusing on improving my math skills, which I expect will help with formalization/communication problem (given a few months), as well as with moving forward. I might post some version of the post, but it seems it won’t be able to serve the previously intended purpose.
Bummer.
As for communication, it would help me (at least) if you used words in their normal senses unless they are a standard LW term of art (e.g. ‘rationalist’ means LW rationalist not Cartesian rationalist ) or unless you specify that you’re using the term in an uncommon sense.
Don’t see how this is related to this thread, and correspondingly what kinds of word misuse you have in mind.
It isn’t related to this thread. I was thinking of past confusions between us over ‘metaethics’ and ‘motivation’ and ‘meaning’ where I didn’t realize until pretty far into the discussion that you were using these terms to mean something different than they normally mean. I’d generally like to avoid that kind of thing; that’s all I meant.
Well, I’m mostly not interested in the concepts corresponding to how these words are “normally” used. The miscommunication problems resulted from both me misinterpreting your usage, and your misinterpreting my usage. I won’t be misinterpreting your usage in similar cases in the future, as I now know better what you mean by which words, and in my own usage, as we discussed a couple of times, I’ll be more clear through using disambiguating qualifiers, which in most cases amounts to writing word “normative” more frequently.
(Still unclear/strange why you brought it up in this particular context, but no matter...)
Yup, that sounds good.
I brought it up because you mentioned communication, and your comment showed up in my LW inbox today.
The robot wants to minimize the amount of blue it sees in its grid representation of the world. It can do this by affecting the world with a laser. But it could also change its camera system so that it sees less blue. If there is no term in the utility function that says that the grid has to model reality, then both approaches are equally valid.
steven0461′s comment notwithstanding, I can take a guess at what the robot actually wants. I think it wants to take the action that will minimize the number of blue cells existing in the world, according to the robot’s current model of the world. That rule for choosing actions probably doesn’t correspond to any coherent utility function over the real world, but that’s not really a surprise.
The interesting question that you probably meant to ask is whether the robot’s utility function over its model of the world can be converted to a utility function over the real world. But the robot won’t agree to any such upgrade, so the question is kinda moot.
That might sound hopeless for CEV, but fortunately humans aren’t consequentialists with a fixed model of the world. Instead they seem to be motivated by pleasure and pain, which you can’t disprove out of existence by coming up with a better model. So maybe there’s hope in that direction.
To avoid SEEING blue things. If the model is good enough for it it’d search out a mirror and laser it’s own camera so that it could NEVER see a blue pixel again.
This can be modelled using human empathy by equating the sensation of seeing blue with pain. You don’t care to minimize damage to your body (if it’s not somehting that actually cripples you), but you care about not getting the signal about it happening, and your reaction to a pill that turned you masochist would be very different than your reaction to a murder pill.
Edit: huh? I am surprised that this is downvoted, and the most probable reason is that I’m wrong in some obvious way that I can’t see, can someone please tell me how? (Or maybe my usage of empathy was interpreted way to literally. )
(Reading this comment first might be helpful.)
To answer your thought experiment. It doesn’t matter what the agent thinks it’s acting based on, we look at it from the outside instead (but using a particular definition/dependence that specifies the agent), and ask how its action depends on the dependence of the actual future on its actual action. Agent’s misconceptions don’t enter this question. If the misconceptions are great, it’ll turn out that the dependence of actual future on agent’s action doesn’t control its action, or controls it in some unexpected way. Alternatively, we could say that it’s not the actual future that is morally relevant for the agent, but some other strange fact, in which case the agent could be said to be optimizing a world that is not ours. From yet another perspective, the role of the action could be played by something else, but then it’s not clear why we are considering such a model and talking about this particular actual agent at the same time.
Is that something you can see from the outside? If I argmax over actions in expected-paper-clips or over updateless-prior-expected-paper-clips, how can you translate my black box behavior over possible worlds into the dependence of my behavior on the dependence of the worlds on my behavior?
See the section “Utility functions” of this post: it shows how a dependence between two fixed facts could be restored in an ideal case where we can learn everything there is to learn about it. Similarly, you could consider the fact of which dependence holds between two facts, with various specific functions as its possible values, and ask what can you infer about that other fact if you assume that the dependence is given by a certain function.
More generally, a dependence follows possible inferences, things that could be inferred about one fact if you learn new things about the other fact. It needs to follow all of such inferences, to the best of agent’s ability, otherwise it won’t be right and you’ll get incorrect decisions (counterfactual models).
Edit: Actually, never mind, I missed your point. Will reply again later. (Done.)
Our world is not really a 2D grid, its world could be. It won’t be a consequentialist about our world then, for that requires the dependence of its decisions on the dependence of our world on its decisions. It looks like the robot you described wants to minimize the 2D grid greenness, and would do that sensibly in the context of 2D grid world, or any world that can influence the 2D grid world. For the robot, our world doesn’t exist in the same sense as 2D grid world doesn’t exist for us, even though we could build an instance of the robot in our world. If we do build such an instance, the robot, if extremely rational and not just acting on heuristics adapted for its natural habitat, could devote its our-worldly existence to finding ways of acausally controlling its 2D world. For example, if there are some 2D-worlders out there simulating our world, it could signal to them something that is expected to reduce greenness.
(This all depends on the details of robot’s decision-making tools, of course. It could really be talking about our world, but then its values collapse, and it could turn out to not be a consequentialist after all, or optimizing something very different, to the extent the conflict in the definitions is strong.)