I’m suggesting that there should be a mathematical operator which takes a “digitized” representation of an agent, either in white-box form (e.g. uploaded human brain) or in black-box form (e.g. chatroom logs) and produces a utility function. There is nothing human-specific in the definition of the operator: it can as well be applied to e.g. another AI, an animal or an alien. It is the input we provide the operator that selects a human utility function.
I don’t understand how such an operator could work.
Suppose I give you a big messy data file that specifies neuron state and connectedness. And then I give you a big complicated finite-element simulator that can accurately predict what a brain would do, given some sensory input. How do you turn that into a utility function?
I understand what it means to use utility as a model of human preference. I don’t understand what it means to say that a given person has a specific utility function. Can you explain exactly what the relationship is between a brain and this abstract utility function?
I don’t see how that addresses the problem. You’re linking to a philosophical answer, and this is an engineering problem.
The claim you made, some posts ago, was “we can set an AI’s goals by reference to a human’s utility function.” Many folks objected that humans don’t really have utility functions. My objection was “we have no idea how to extract a utility function, even given complete data about a human’s brain.” Defining “utility function” isn’t a solution. If you want to use “the utility function of a particular human” in building an AI, you need not only a definition, but a construction. To be convincing in this conversation, you would need to at least give some evidence that such a construction is possible.
You are trying to use, as a subcomponent, something we have no idea how to build and that seems possibly as hard as the original problem. And this isn’t a good way to do engineering.
The way I expect AGI to work is receiving a mathematical definition of its utility function as input. So there is no need to have a “construction”. I don’t even know what a “construction” is, in this context.
Note that in my formal definition of intelligence, we can use any appropriate formula* in the given formal language as a utility function, since it all comes down to computing logical expectation values. In fact I expect a real seed AGI to work through computing logical expectation values (by an approximate method, probably some kind of Monte Carlo).
Of course, if the AGI design we will come up with is only defined for a certain category of utility functions then we need to somehow project into this category (assuming the category is rich enough for the projection not to lose too much information). The construction of this projection operator indeed might be very difficult.
In practice, I formulated the definition with utility = Solomonoff expectation value of something computable. But this restriction isn’t necessary. Note that my proposal for defining logical probabilities admits self reference in the sense that the reasoning system is allowed to speak of the probabilities it assigns (like in Christiano et al).
Humans don’t follow anything like a utility function, which is a first problem, so you’re asking the AI to construct something that isn’t there. Then you have to knit this together into a humanity utility function, which is very non trivial (this is one feeble and problematic way of doing this: http://lesswrong.com/r/discussion/lw/8qb/cevinspired_models/).
The other problem is that you haven’t actually solved many of the hard problems. Suppose the AI decides to kill everyone, then replay, in an endless loop, the one upload it has, having a marvellous experience. Why would it not do that? We want the AI to correctly balance our higher order preferences (not being reduced to a single mindless experience) with our lower order preferences (being happy). But that desire is itself a higher order preference—it won’t happen unless the AI already decides that higher order preferences trump lower ones.
And that was one example I just thought of. It’s not hard to come up with “the AI does something stupid in this model (eg: replaces everyone with chatterbots that describe their ever increasing happiness and fulfilment) that is compatible with the original model but clearly stupid—clearly stupid to our own judgement, though, not to the AIs.
You may object that these problems won’t happen—but you can’t be confident of this, as you haven’t defined your solution formally, and are relying on common sense to reject those pathological solutions. But nowhere have you assumed the AI has common sense, or how it will use it. The more details you put in your model, I think, the more the problems will become apparent.
Deducing the correct utility of a utility maximiser is one thing (which has a low level of uncertainty, higher if the agent is hiding stuff).
In the white-box approach it can’t really hide. But I guess it’s rather tangential to the discussion.
Assigning a utility to an agent that doesn’t have one is quite another… Humans don’t follow anything like a utility function, which is a first problem, so you’re asking the AI to construct something that isn’t there.
What do you mean by “follow a utility function”? Why do you thinks humans don’t do it? If it isn’t there, what does it mean to have a correct solution to the FAI problem?
The robot is a behavior-executor, not a utility-maximizer.
The main problem with Yvain’s thesis is in the paragraph:
Again, give the robot human level intelligence. Teach it exactly what a hologram projector is and how it works. Now what happens? Exactly the same thing—the robot executes its code, which says to scan the room until its camera registers blue, then shoot its laser.
What does Yvain mean by “give the robot human level intelligence”? If the robot’s code remained the same, in what sense does it have human level intelligence?
Then you have to knit this together into a humanity utility function, which is very non trivial.
This is the part of the CEV proposal which always seemed redundant to me. Why should we do it? If you’re designing the AI, why wouldn’t you use your own utility function? At worst, an average utility function of the group of AI designers? Why do we want / need the whole humanity there? Btw, I would obviously prefer my utility function in the AI but I’m perfectly willing to settle on e.g. Yudkowsky’s.
Suppose the AI decides to kill everyone, then replay, in an endless loop, the one upload it has, having a marvellous experience… the AI does something stupid in this model (eg: replaces everyone with chatterbots that describe their ever increasing happiness and fulfilment)...
It seems that you’re identifying my proposal with something like “maximize pleasure”. The latter is a notoriously bad idea, as was discussed endlessly. However, my proposal is completely different. The AI wouldn’t do something the upload wouldn’t do because such an action is opposed to the upload’s utility function.
You may object that these problems won’t happen—but you can’t be confident of this, as you haven’t defined your solution formally...
Actually, I’m not far from it (at least I don’t think I’m further than CEV). Note that I have already defined formally
I(A, U) where I=intelligence, A=agent, U=utility function. Now we can do something like “U(A) is defined to be U s.t. the probability that I(A, U) > I(R, U) for random agent R is maximal”. Maybe it’s more correct to use something like a thermal ensemble with I(A, U) playing the role of energy: I don’t know, I don’t claim to have solved it all already. I just think it’s a good research direction.
What do you mean by “follow a utility function”? Why do you thinks humans don’t do it?
Humans are neither independent not transitive. Human preferences change over time, depending on arbitrary factors, including how choices are framed. Humans suffer because of things they cannot affect, and humans suffer because of details of their probability assessment (eg ambiguity aversion). That bears repeating—humans have preference over their state of knowledge. The core of this is that “assessment of fact” and “values” are not disconnected in humans, not disconnected at all. Humans feel good when a team they support wins, without them contributing anything to the victory. They will accept false compliments, and can be flattered. Social pressure changes most values quite easily.
Need I go on?
If it isn’t there, what does it mean to have a correct solution to the FAI problem?
A utility function which, if implemented by the AI, would result in a positive, fulfilling, worthwhile existence for humans. Even if humans had a utility, it’s not clear that a ruling FAI should have the same one, incidentally. The utility is for the AI, and it aims to capture as much of human value as possible—it might just be the utility of a nanny AI (make reasonable efforts to keep humanity from developing dangerous AIs, going extinct, or regressing technologically, otherwise, let them be).
What do you mean by “follow a utility function”? Why do you thinks humans don’t do it?
Humans are neither independent not transitive…
You still haven’t defined “follow a utility function”. Humans are not ideal rational optimizers of their respective utility functions. It doesn’t mean they don’t have them. Deep Blue often plays moves which are not ideal, nevertheless I think it’s fair to say it optimizes winning. If you make intransitive choices, it doesn’t mean your terminal values are intransitive. It means your choices are not optimal.
Human preferences change over time...
This is probably the case. However, the changes are slow, otherwise humans wouldn’t behave coherently at all. The human utility function is only defined approximately, but the FAI problem only makes sense in the same approximation. In any case, if you’re programming an AI you should equip it with the utility function you have at that moment.
...humans have preference over their state of knowledge...
Why do you think it is inconsistent with having a utility function?
...what does it mean to have a correct solution to the FAI problem?
A utility function which, if implemented by the AI, would result in a positive, fulfilling, worthwhile existence for humans.
How can you know that a given utility function has this property? How do you know the utility function I’m proposing doesn’t have this property?
Even if humans had a utility, it’s not clear that a ruling FAI should have the same one, incidentally.
Isn’t it? Assume your utility function is U. Suppose you have the choice to create a superintelligence optimizing U or a superintelligence optimizing something other than U, let say V. Why would you choose V? Choosing U will obviously result in an enormous expected increase of U, which is what you want to happen, since you’re a U-maximizing agent. Choosing V will almost certainly result in a lower expectation value of U: if the V-AI chooses strategy X that leads to higher expected U than the strategy that would be chosen by a U-AI then it’s not clear why the U-AI wouldn’t choose X.
Humans are not ideal rational optimizers of their respective utility functions.
Then why claim that they have one? If humans have intransitive preferences (A>B>C>A), as I often do, then why claim that actually their preferences are secretly transitive but they fail to act on them properly? Nothing we know about the brain points to there being a hidden box with a pristine and pure utility function, that we then implement poorly.
...humans have preference over their state of knowledge...
Why do you think it is inconsistent with having a utility function?
They have preferences like ambiguity aversion, eg being willing to pay to find out, during a holiday, whether they were accepted for a job, while knowing that they can’t make any relevant decisions with that early knowledge. This is not compatible with following a standard utility function.
They have preferences like ambiguity aversion, eg being willing to pay to find out, during a holiday, whether they were accepted for a job, while knowing that they can’t make any relevant decisions with that early knowledge. This is not compatible with following a standard utility function.
I don’t know what you mean by “standard” utility function. I don’t even know what you mean by “following”. We want to find out since uncertainty makes you nervous, being nervous is unpleasant and pleasure is a terminal value. It is entirely consistent with having a utility function and with my formalism in particular.
Humans are not ideal rational optimizers of their respective utility functions.
Then why claim that they have one? If humans have intransitive preferences (A>B>C>A), as I often do, then why claim that actually their preferences are secretly transitive but they fail to act on them properly?
In what epistemology are you asking this question? That is, what is the criterion according to which the validity of answer would be determined?
If you don’t think human preferences are “secretly transitive”, then why do you suggest the following:
Whenever revealed preferences are non-transitive or non-independent, use the person’s stated meta-preferences to remove the issue. The AI thus calculates what the person would say if asked to resolve the transitivity or independence (for people who don’t know about the importance of resolving them, the AI would present them with a set of transitive and independent preferences, derived from their revealed preferences, and have them choose among them).
What is the meaning of asking a person to resolve intransitivities if there are no transitive preferences underneath?
That is, what is the criterion according to which the validity of answer would be determined?
Those are questions for you, not for me. You’re claiming that humans have a hidden utility function. What do you mean by that, and what evidence do you have for your position?
I’m claiming that it is possible to define the utility function of any agent. For unintelligent “agents” the result is probably unstable. For intelligent agents the result should be stable.
The evidence is that I have a formalism which produces this definition in a way compatible with intuition about “agent having a utility function”. I cannot present evidence which doesn’t rely on intuition since that would require having another more fundamental definition of “agent having a utility function” (which AFAIK might not exist). I do not consider this to be a problem since all reasoning falls back to intuition if you ask “why” sufficiently many times.
I don’t see any meaningful definition of intelligence or instrumental rationality without a utility function. If we accepts humans are (approximately) rational / intelligent, they must (in the same approximation) have utility functions.
It also seems to me (again, intuitively) that the very concept of “preference” is incompatible with e.g. intransitivity. In the approximation it makes sense to speak of “preferences” at all, it makes sense to speak of preferences compatible with the VNM axioms ergo utility function. Same goes for the concept of “should”. If it makes sense to say one “should” do something (for example build a FAI), there must be a utility function according to which she should do it.
Bottom line, eventually it all hits philosophical assumptions which have no further formal justification. However, this is true of all reasoning. IMO the only valid method to disprove such assumptions is either by reductio ad absurdum or by presenting a different set of assumptions which is better in some sense. If you have such an alternative set of assumption for this case or a wholly different way to resolve philosophical questions, I would be very interested to know.
I’m claiming that it is possible to define the utility function of any agent.
It is trivially possible to do that. Since no choice is strictly identical, you just add enough details to make each choice unique, and then choose a utility function that will always reach that choice (“subject has a strong preference for putting his left foot forwards when seeing an advertisement for deodorant on Tuesday morning that are the birthdays of prominent Dutch politicians”).
A good simple model of human behaviour is that of different modules expressing preferences and short-circuiting the decision making in some circumstances, and a more rational system (“system 2”) occasionally intervening to prevent loss through money pumps. So people are transitive in their ultimate decisions, often and to some extent, but their actual decisions depend strongly on which choices are presented first (ie their low level preferences are intransitive, but the rational part of them prevents loops). Would you say these beings have no preferences?
I’m claiming that it is possible to define the utility function of any agent.
It is trivially possible to do that. Since no choice is strictly identical, you just add enough details to make each choice unique, and then choose a utility function that will always reach that choice
My formalism doesn’t work like that since the utility function is a function over possible universes, not over possible choices. There is no trivial way to construct a utility function wrt which the given agent’s intelligence is close to maximal. However it still might be the case we need to give larger weight to simple utility functions (otherwise we’re left with selecting a maximum in an infinite set and it’s not clear why it exists). As I said, I don’t have the final formula.
A good simple model of human behaviour is that of different modules expressing preferences and short-circuiting the decision making in some circumstances, and a more rational system (“system 2”) occasionally intervening to prevent loss through money pumps. So people are transitive in their ultimate decisions, often and to some extent, but their actual decisions depend strongly on which choices are presented first (ie their low level preferences are intransitive, but the rational part of them prevents loops). Would you say these beings have no preferences?
I’d say they have a utility function. Image a chess AI that selects moves by one of two strategies. The first strategy (“system 1”) uses simple heuristics like “check when you can” that produce an answer quickly and save precious time. The second strategy (“system 2”) runs a minimax algorithm with a 10-move deep search tree. Are all of the agent’s decisions perfectly rational? No. Does it have a utility function? Yes: winning the game.
There are many such operators, and different ones give different answers when presented with the same agent. Only a human utility function distinguishes the right way of interpreting a human mind as having a utility function from all of the wrong ways of interpreting a human mind as having a utility function. So you need to get a bunch of Friendliness Theory right before you can bootstrap.
Why do you think there are many such operators? Do you believe the concept of “utility function of an agent” is ill-defined (assuming the “agent” is actually an intelligent agent rather than e.g. a rock)? Do you think it is possible to interpret a paperclip maximizer as having a utility function other than maximizing paperclips?
Deducing the correct utility of a utility maximiser is one thing (which has a low level of uncertainty, higher if the agent is hiding stuff). Assigning a utility to an agent that doesn’t have one is quite another.
I’m probably explaining myself poorly.
I’m suggesting that there should be a mathematical operator which takes a “digitized” representation of an agent, either in white-box form (e.g. uploaded human brain) or in black-box form (e.g. chatroom logs) and produces a utility function. There is nothing human-specific in the definition of the operator: it can as well be applied to e.g. another AI, an animal or an alien. It is the input we provide the operator that selects a human utility function.
I don’t understand how such an operator could work.
Suppose I give you a big messy data file that specifies neuron state and connectedness. And then I give you a big complicated finite-element simulator that can accurately predict what a brain would do, given some sensory input. How do you turn that into a utility function?
I understand what it means to use utility as a model of human preference. I don’t understand what it means to say that a given person has a specific utility function. Can you explain exactly what the relationship is between a brain and this abstract utility function?
See the last paragraph in this comment.
I don’t see how that addresses the problem. You’re linking to a philosophical answer, and this is an engineering problem.
The claim you made, some posts ago, was “we can set an AI’s goals by reference to a human’s utility function.” Many folks objected that humans don’t really have utility functions. My objection was “we have no idea how to extract a utility function, even given complete data about a human’s brain.” Defining “utility function” isn’t a solution. If you want to use “the utility function of a particular human” in building an AI, you need not only a definition, but a construction. To be convincing in this conversation, you would need to at least give some evidence that such a construction is possible.
You are trying to use, as a subcomponent, something we have no idea how to build and that seems possibly as hard as the original problem. And this isn’t a good way to do engineering.
The way I expect AGI to work is receiving a mathematical definition of its utility function as input. So there is no need to have a “construction”. I don’t even know what a “construction” is, in this context.
Note that in my formal definition of intelligence, we can use any appropriate formula* in the given formal language as a utility function, since it all comes down to computing logical expectation values. In fact I expect a real seed AGI to work through computing logical expectation values (by an approximate method, probably some kind of Monte Carlo).
Of course, if the AGI design we will come up with is only defined for a certain category of utility functions then we need to somehow project into this category (assuming the category is rich enough for the projection not to lose too much information). The construction of this projection operator indeed might be very difficult.
In practice, I formulated the definition with utility = Solomonoff expectation value of something computable. But this restriction isn’t necessary. Note that my proposal for defining logical probabilities admits self reference in the sense that the reasoning system is allowed to speak of the probabilities it assigns (like in Christiano et al).
Humans don’t follow anything like a utility function, which is a first problem, so you’re asking the AI to construct something that isn’t there. Then you have to knit this together into a humanity utility function, which is very non trivial (this is one feeble and problematic way of doing this: http://lesswrong.com/r/discussion/lw/8qb/cevinspired_models/).
The other problem is that you haven’t actually solved many of the hard problems. Suppose the AI decides to kill everyone, then replay, in an endless loop, the one upload it has, having a marvellous experience. Why would it not do that? We want the AI to correctly balance our higher order preferences (not being reduced to a single mindless experience) with our lower order preferences (being happy). But that desire is itself a higher order preference—it won’t happen unless the AI already decides that higher order preferences trump lower ones.
And that was one example I just thought of. It’s not hard to come up with “the AI does something stupid in this model (eg: replaces everyone with chatterbots that describe their ever increasing happiness and fulfilment) that is compatible with the original model but clearly stupid—clearly stupid to our own judgement, though, not to the AIs.
You may object that these problems won’t happen—but you can’t be confident of this, as you haven’t defined your solution formally, and are relying on common sense to reject those pathological solutions. But nowhere have you assumed the AI has common sense, or how it will use it. The more details you put in your model, I think, the more the problems will become apparent.
Thank you for the thoughtful reply!
In the white-box approach it can’t really hide. But I guess it’s rather tangential to the discussion.
What do you mean by “follow a utility function”? Why do you thinks humans don’t do it? If it isn’t there, what does it mean to have a correct solution to the FAI problem?
The main problem with Yvain’s thesis is in the paragraph:
What does Yvain mean by “give the robot human level intelligence”? If the robot’s code remained the same, in what sense does it have human level intelligence?
This is the part of the CEV proposal which always seemed redundant to me. Why should we do it? If you’re designing the AI, why wouldn’t you use your own utility function? At worst, an average utility function of the group of AI designers? Why do we want / need the whole humanity there? Btw, I would obviously prefer my utility function in the AI but I’m perfectly willing to settle on e.g. Yudkowsky’s.
It seems that you’re identifying my proposal with something like “maximize pleasure”. The latter is a notoriously bad idea, as was discussed endlessly. However, my proposal is completely different. The AI wouldn’t do something the upload wouldn’t do because such an action is opposed to the upload’s utility function.
Actually, I’m not far from it (at least I don’t think I’m further than CEV). Note that I have already defined formally I(A, U) where I=intelligence, A=agent, U=utility function. Now we can do something like “U(A) is defined to be U s.t. the probability that I(A, U) > I(R, U) for random agent R is maximal”. Maybe it’s more correct to use something like a thermal ensemble with I(A, U) playing the role of energy: I don’t know, I don’t claim to have solved it all already. I just think it’s a good research direction.
Humans are neither independent not transitive. Human preferences change over time, depending on arbitrary factors, including how choices are framed. Humans suffer because of things they cannot affect, and humans suffer because of details of their probability assessment (eg ambiguity aversion). That bears repeating—humans have preference over their state of knowledge. The core of this is that “assessment of fact” and “values” are not disconnected in humans, not disconnected at all. Humans feel good when a team they support wins, without them contributing anything to the victory. They will accept false compliments, and can be flattered. Social pressure changes most values quite easily.
Need I go on?
A utility function which, if implemented by the AI, would result in a positive, fulfilling, worthwhile existence for humans. Even if humans had a utility, it’s not clear that a ruling FAI should have the same one, incidentally. The utility is for the AI, and it aims to capture as much of human value as possible—it might just be the utility of a nanny AI (make reasonable efforts to keep humanity from developing dangerous AIs, going extinct, or regressing technologically, otherwise, let them be).
You still haven’t defined “follow a utility function”. Humans are not ideal rational optimizers of their respective utility functions. It doesn’t mean they don’t have them. Deep Blue often plays moves which are not ideal, nevertheless I think it’s fair to say it optimizes winning. If you make intransitive choices, it doesn’t mean your terminal values are intransitive. It means your choices are not optimal.
This is probably the case. However, the changes are slow, otherwise humans wouldn’t behave coherently at all. The human utility function is only defined approximately, but the FAI problem only makes sense in the same approximation. In any case, if you’re programming an AI you should equip it with the utility function you have at that moment.
Why do you think it is inconsistent with having a utility function?
How can you know that a given utility function has this property? How do you know the utility function I’m proposing doesn’t have this property?
Isn’t it? Assume your utility function is U. Suppose you have the choice to create a superintelligence optimizing U or a superintelligence optimizing something other than U, let say V. Why would you choose V? Choosing U will obviously result in an enormous expected increase of U, which is what you want to happen, since you’re a U-maximizing agent. Choosing V will almost certainly result in a lower expectation value of U: if the V-AI chooses strategy X that leads to higher expected U than the strategy that would be chosen by a U-AI then it’s not clear why the U-AI wouldn’t choose X.
Then why claim that they have one? If humans have intransitive preferences (A>B>C>A), as I often do, then why claim that actually their preferences are secretly transitive but they fail to act on them properly? Nothing we know about the brain points to there being a hidden box with a pristine and pure utility function, that we then implement poorly.
They have preferences like ambiguity aversion, eg being willing to pay to find out, during a holiday, whether they were accepted for a job, while knowing that they can’t make any relevant decisions with that early knowledge. This is not compatible with following a standard utility function.
I don’t know what you mean by “standard” utility function. I don’t even know what you mean by “following”. We want to find out since uncertainty makes you nervous, being nervous is unpleasant and pleasure is a terminal value. It is entirely consistent with having a utility function and with my formalism in particular.
In what epistemology are you asking this question? That is, what is the criterion according to which the validity of answer would be determined?
If you don’t think human preferences are “secretly transitive”, then why do you suggest the following:
What is the meaning of asking a person to resolve intransitivities if there are no transitive preferences underneath?
Those are questions for you, not for me. You’re claiming that humans have a hidden utility function. What do you mean by that, and what evidence do you have for your position?
I’m claiming that it is possible to define the utility function of any agent. For unintelligent “agents” the result is probably unstable. For intelligent agents the result should be stable.
The evidence is that I have a formalism which produces this definition in a way compatible with intuition about “agent having a utility function”. I cannot present evidence which doesn’t rely on intuition since that would require having another more fundamental definition of “agent having a utility function” (which AFAIK might not exist). I do not consider this to be a problem since all reasoning falls back to intuition if you ask “why” sufficiently many times.
I don’t see any meaningful definition of intelligence or instrumental rationality without a utility function. If we accepts humans are (approximately) rational / intelligent, they must (in the same approximation) have utility functions.
It also seems to me (again, intuitively) that the very concept of “preference” is incompatible with e.g. intransitivity. In the approximation it makes sense to speak of “preferences” at all, it makes sense to speak of preferences compatible with the VNM axioms ergo utility function. Same goes for the concept of “should”. If it makes sense to say one “should” do something (for example build a FAI), there must be a utility function according to which she should do it.
Bottom line, eventually it all hits philosophical assumptions which have no further formal justification. However, this is true of all reasoning. IMO the only valid method to disprove such assumptions is either by reductio ad absurdum or by presenting a different set of assumptions which is better in some sense. If you have such an alternative set of assumption for this case or a wholly different way to resolve philosophical questions, I would be very interested to know.
It is trivially possible to do that. Since no choice is strictly identical, you just add enough details to make each choice unique, and then choose a utility function that will always reach that choice (“subject has a strong preference for putting his left foot forwards when seeing an advertisement for deodorant on Tuesday morning that are the birthdays of prominent Dutch politicians”).
A good simple model of human behaviour is that of different modules expressing preferences and short-circuiting the decision making in some circumstances, and a more rational system (“system 2”) occasionally intervening to prevent loss through money pumps. So people are transitive in their ultimate decisions, often and to some extent, but their actual decisions depend strongly on which choices are presented first (ie their low level preferences are intransitive, but the rational part of them prevents loops). Would you say these beings have no preferences?
My formalism doesn’t work like that since the utility function is a function over possible universes, not over possible choices. There is no trivial way to construct a utility function wrt which the given agent’s intelligence is close to maximal. However it still might be the case we need to give larger weight to simple utility functions (otherwise we’re left with selecting a maximum in an infinite set and it’s not clear why it exists). As I said, I don’t have the final formula.
I’d say they have a utility function. Image a chess AI that selects moves by one of two strategies. The first strategy (“system 1”) uses simple heuristics like “check when you can” that produce an answer quickly and save precious time. The second strategy (“system 2”) runs a minimax algorithm with a 10-move deep search tree. Are all of the agent’s decisions perfectly rational? No. Does it have a utility function? Yes: winning the game.
There are many such operators, and different ones give different answers when presented with the same agent. Only a human utility function distinguishes the right way of interpreting a human mind as having a utility function from all of the wrong ways of interpreting a human mind as having a utility function. So you need to get a bunch of Friendliness Theory right before you can bootstrap.
Why do you think there are many such operators? Do you believe the concept of “utility function of an agent” is ill-defined (assuming the “agent” is actually an intelligent agent rather than e.g. a rock)? Do you think it is possible to interpret a paperclip maximizer as having a utility function other than maximizing paperclips?
Deducing the correct utility of a utility maximiser is one thing (which has a low level of uncertainty, higher if the agent is hiding stuff). Assigning a utility to an agent that doesn’t have one is quite another.
See http://lesswrong.com/lw/6ha/the_blueminimizing_robot/ Key quote:
Replied in the other thread.