What if the AI’s utility function is to find the right utility function
Coding your appreciation of ‘right’ is more difficult than you think. This is, essentially, what CEV is—an attempt at figuring out how an FAI can find the ‘right’ utility function.
Its goals could be such as learning to understand us, obey us, and predict what we might want/like/approve, moving its object-level goals to what would satisfy humanity?
In other words, a probabilistic utility function with great amounts of uncertainty, and great amounts of apprehension to change, or stability.
You’re talking about normative uncertainty, which is a slightly different problem than epistemic uncertainty. The easiest way too do this would be to reduce the problem to an epistemic one (these are the characteristics of the correct utility function, now reason probabilistically about which of these candidate functions is it), but that still has the action problem—an agent takes actions based on it’s utility function—if it has a weighting over all utility functions, it may act in undesirable ways, particularly if it doesn’t quickly converge to a single solution. There are a few other problems I could see with that approach—the original specification of ‘correctness’ has to be almost Friendliness-Complete; it must be specific enough to pick out a single function (or perhaps many functions, all of which are what we want to want, without being compatible with any undesirable solutions). Also, a seed AI may not be able to follow the specification correctly, a superintelligence is going to have to have some well-specified goal along the lines of “increase your capability without doing anything bad, until you have the ability to solve this problem, and then adopt the solution as your utility function”. You may noticed a familiar problem in the bolded part of that (English—remember we have to be able to code all of this) sentence.
What if the AI’s utility function is to find the right utility function
Coding your appreciation of ‘right’ is more difficult than you think.
I mean, instead of coding it, have it be uncertain about what is “right,” and to guide itself using human claims. I’m thinking of the equivalent of something in EY’s CFAI, but I’ve forgotten the terminology.
In other words, a meta-utility function. Why can’t it weight actions based on what we as a society want/like/approve/consent/condone? A behavioristic learner, with reward/punishment and an intention to preserve the semantic significance of the reward/punishment channel.
if it has a weighting over all utility functions, it may act in undesirable ways, particularly if it doesn’t quickly converge to a single solution.
When I said uncertainty, I was also implying inaction. I suppose inaction could be an undesirable way in which to act, but it’s better to get it right slowly than to get it wrong very quickly. What I’m describing isn’t really a utility function, it’s more like a policy, or policy function. Its policy would be volatile, or at least, more volatile than the common understanding LW has of a set-in-stone utility function.
If a utility function really needs to be pinpointed so exactly, surrounded by death and misery on all sides, why are we using a utility function to decide action? There are other approaches. Where did LW’s/EY’s concept of utility function come from, and why did they assume it was an essential part of AI?
Why can’t it weight actions based on what we as a society want/like/approve/consent/condone? A behavioristic learner, with reward/punishment and an intention to preserve the semantic significance of the reward/punishment channel.
Most obviously, it’s very easy for a powerful AI to take unexpected control of the reward/punishment channel, and trivial for a superintelligent AGI to do so in Very Bad ways. You’ve tried to block the basic version of this—an AGI pressing its own “society liked this” button—with the phrase ‘semantic significance’, but that’s not really a codable concept. If the AGI isn’t allowed to press the button itself, it might build a machine that would do so. If it isn’t allowed to do that, it might wirehead a human into doing so. If it isn’t allowed /that/, it might put a human near a Paradise Machine and only let them into the box when the button had been pressed. If the AGI’s reward is based on the number of favorable news reports, now you have an AGI that’s rewarded for manipulating its own media coverage. So on, and so forth.
The sort of semantic significance you’re talking about is a pretty big part of Friendliness theory.
The deeper problem is that the things our society wants aren’t necessarily Friendly, especially when extrapolated. One of the secondary benefits of Friendliness research is that it requires the examination of our own interests.
Its policy would be volatile, or at least, more volatile than the common understanding LW has of a set-in-stone utility function.
The ‘set-in-stone’ nature of a utility function is actually a desired benefit, albeit a difficult one to achieve (“Lob’s Problem” and the more general issue of value drift). A machine with undirected volatility in its utility function will take random variations in its choices, and there are orders of magnitude more wrong random answers than correct ones on this matter.
If you can direct the drift, that’s less of an issue, but then you could just make /that/ direction the utility function.
Where did LW’s/EY’s concept of utility function come from, and why did they assume it was an essential part of AI?
The basic idea of goal maximization is a fairly common thing when working with evolutionary algorithms (see XKCD for a joking example), because it’s such a useful model. While there are other types of possible minds, maximizers of /some/ kind with unbounded or weakly bounded potential are the most relevant to MIRI’s concerns because they have the greatest potential for especially useful and especially harmful results.
Why can’t it weight actions based on what we as a society want/like/approve/consent/condone?
Human society would not do a good job being directly in charge of a naive omnipotent genie. Insert your own nightmare scenario examples here, there are plenty to choose from.
What I’m describing isn’t really a utility function, it’s more like a policy, or policy function. Its policy would be volatile, or at least, more volatile than the common understanding LW has of a set-in-stone utility function.
Why can’t it weight actions based on what we as a society w/l/a/c/c?
Human society would not do a good job being directly in charge of a naive omnipotent genie. Insert your own nightmare scenario examples here, there are plenty to choose from.
But that doesn’t describe humanity being directly in charge. It only describes a small bit of influence for each person, and while groups would have leverage, that doesn’t mean a majority rejecting, say, homosexuality, gets to say what LGB people can and can’t do/be.
What I’m describing isn’t really a utility function, it’s more like a policy, or policy function. Its policy would be volatile, or at least, more volatile than the common understanding LW has of a set-in-stone utility function.
What would be in charge of changing the policy?
The metautility function I described.
What is a society’s intent? What should a society’s goals be, and how should it relate to the goals of its constituents?
Good point. I think I was reluctant to use pedophilia as an example because I’m trying to defend this argument, and claiming it could allow pedophilia is not usually convincing. RAT − 1 for me.
I’ll concede that point. But my questions aren’t rhetorical, I think. There is no objective morality, and EY seems to be trying to get around that. Concessions must be made.
I’m thinking that the closest thing we could have to CEV is a social contract based on Rawls’ veil of ignorance, adjusted with live runoff of supply/demand (i.e. the less people want slavery, the more likely that someone who wants slavery would become a slave, so prospective slaveowners would be less likely to approve of slavery on the grounds that they themselves do not want to be slaves. Meanwhile, people who want to become slaves get what they want as well. By no means is this a rigorous definition or claim.), in a post-scarcity economy, with sharding of some sort (as in CelestAI sharding, where parts of society that contribute negative utility to an individual are effectively invisible to said individual. There was an argument on LW that CEV would be impossible without some elements of separation similar to this).
The less people want aristocracy, the more likely that someone who wants aristocracy would become a noble, so prospective nobles would be more like to approve of aristocracy on the grounds that they themselves want to be nobles?
The less people want aristocracy, the more likely that someone who wants aristocracy would become a peon, so prospective nobles would be less likely to approve of aristocracy on the grounds that they themselves want to be peons.
Coding your appreciation of ‘right’ is more difficult than you think. This is, essentially, what CEV is—an attempt at figuring out how an FAI can find the ‘right’ utility function.
You’re talking about normative uncertainty, which is a slightly different problem than epistemic uncertainty. The easiest way too do this would be to reduce the problem to an epistemic one (these are the characteristics of the correct utility function, now reason probabilistically about which of these candidate functions is it), but that still has the action problem—an agent takes actions based on it’s utility function—if it has a weighting over all utility functions, it may act in undesirable ways, particularly if it doesn’t quickly converge to a single solution. There are a few other problems I could see with that approach—the original specification of ‘correctness’ has to be almost Friendliness-Complete; it must be specific enough to pick out a single function (or perhaps many functions, all of which are what we want to want, without being compatible with any undesirable solutions). Also, a seed AI may not be able to follow the specification correctly, a superintelligence is going to have to have some well-specified goal along the lines of “increase your capability without doing anything bad, until you have the ability to solve this problem, and then adopt the solution as your utility function”. You may noticed a familiar problem in the bolded part of that (English—remember we have to be able to code all of this) sentence.
I mean, instead of coding it, have it be uncertain about what is “right,” and to guide itself using human claims. I’m thinking of the equivalent of something in EY’s CFAI, but I’ve forgotten the terminology.
In other words, a meta-utility function. Why can’t it weight actions based on what we as a society want/like/approve/consent/condone? A behavioristic learner, with reward/punishment and an intention to preserve the semantic significance of the reward/punishment channel.
When I said uncertainty, I was also implying inaction. I suppose inaction could be an undesirable way in which to act, but it’s better to get it right slowly than to get it wrong very quickly. What I’m describing isn’t really a utility function, it’s more like a policy, or policy function. Its policy would be volatile, or at least, more volatile than the common understanding LW has of a set-in-stone utility function.
If a utility function really needs to be pinpointed so exactly, surrounded by death and misery on all sides, why are we using a utility function to decide action? There are other approaches. Where did LW’s/EY’s concept of utility function come from, and why did they assume it was an essential part of AI?
Most obviously, it’s very easy for a powerful AI to take unexpected control of the reward/punishment channel, and trivial for a superintelligent AGI to do so in Very Bad ways. You’ve tried to block the basic version of this—an AGI pressing its own “society liked this” button—with the phrase ‘semantic significance’, but that’s not really a codable concept. If the AGI isn’t allowed to press the button itself, it might build a machine that would do so. If it isn’t allowed to do that, it might wirehead a human into doing so. If it isn’t allowed /that/, it might put a human near a Paradise Machine and only let them into the box when the button had been pressed. If the AGI’s reward is based on the number of favorable news reports, now you have an AGI that’s rewarded for manipulating its own media coverage. So on, and so forth.
The sort of semantic significance you’re talking about is a pretty big part of Friendliness theory.
The deeper problem is that the things our society wants aren’t necessarily Friendly, especially when extrapolated. One of the secondary benefits of Friendliness research is that it requires the examination of our own interests.
The ‘set-in-stone’ nature of a utility function is actually a desired benefit, albeit a difficult one to achieve (“Lob’s Problem” and the more general issue of value drift). A machine with undirected volatility in its utility function will take random variations in its choices, and there are orders of magnitude more wrong random answers than correct ones on this matter.
If you can direct the drift, that’s less of an issue, but then you could just make /that/ direction the utility function.
The basic idea of goal maximization is a fairly common thing when working with evolutionary algorithms (see XKCD for a joking example), because it’s such a useful model. While there are other types of possible minds, maximizers of /some/ kind with unbounded or weakly bounded potential are the most relevant to MIRI’s concerns because they have the greatest potential for especially useful and especially harmful results.
Human society would not do a good job being directly in charge of a naive omnipotent genie. Insert your own nightmare scenario examples here, there are plenty to choose from.
What would be in charge of changing the policy?
But that doesn’t describe humanity being directly in charge. It only describes a small bit of influence for each person, and while groups would have leverage, that doesn’t mean a majority rejecting, say, homosexuality, gets to say what LGB people can and can’t do/be.
The metautility function I described.
What is a society’s intent? What should a society’s goals be, and how should it relate to the goals of its constituents?
I think it means precisely that if the majority feels strongly enough about it.
For a quick example s/homosexuality/pedophilia/
Good point. I think I was reluctant to use pedophilia as an example because I’m trying to defend this argument, and claiming it could allow pedophilia is not usually convincing. RAT − 1 for me.
I’ll concede that point. But my questions aren’t rhetorical, I think. There is no objective morality, and EY seems to be trying to get around that. Concessions must be made.
I’m thinking that the closest thing we could have to CEV is a social contract based on Rawls’ veil of ignorance, adjusted with live runoff of supply/demand (i.e. the less people want slavery, the more likely that someone who wants slavery would become a slave, so prospective slaveowners would be less likely to approve of slavery on the grounds that they themselves do not want to be slaves. Meanwhile, people who want to become slaves get what they want as well. By no means is this a rigorous definition or claim.), in a post-scarcity economy, with sharding of some sort (as in CelestAI sharding, where parts of society that contribute negative utility to an individual are effectively invisible to said individual. There was an argument on LW that CEV would be impossible without some elements of separation similar to this).
The less people want aristocracy, the more likely that someone who wants aristocracy would become a noble, so prospective nobles would be more like to approve of aristocracy on the grounds that they themselves want to be nobles?
The less people want aristocracy, the more likely that someone who wants aristocracy would become a peon, so prospective nobles would be less likely to approve of aristocracy on the grounds that they themselves want to be peons.
I have to work this out. You have a good point.