What if the AI’s utility function is to find the right utility function, being guided along the way? Its goals could be such as learning to understand us, obey us, and predict what we might want/like/approve, moving its object-level goals to what would satisfy humanity? In other words, a probabilistic utility function with great amounts of uncertainty, and great amounts of apprehension to change, or stability.
Regardless of the above questions/statement, I think much of the complexity of human utility comes from complexities of belief.
If we offload complexity of the AI’s utility function into very uncertainly defined concepts, and give it an apprehension to do anything but observe given such little data… I don’t know, though. This has been something I’ve been sitting on for a while, lambast me.
As one last thing, I think the best kind of FAI would be a singleton, with a metautility function, or society’s utility function. I think one part of Friendliness would be determining a utility function for society, as to how people can interfere with each other in what circumstances, and then build the genie’s utility function in the singleton’s constraints.
Please critique. If my ideas are as unclear as I think they may be (I’m sick), please mention it.
What if the AI’s utility function is to find the right utility function
Coding your appreciation of ‘right’ is more difficult than you think. This is, essentially, what CEV is—an attempt at figuring out how an FAI can find the ‘right’ utility function.
Its goals could be such as learning to understand us, obey us, and predict what we might want/like/approve, moving its object-level goals to what would satisfy humanity?
In other words, a probabilistic utility function with great amounts of uncertainty, and great amounts of apprehension to change, or stability.
You’re talking about normative uncertainty, which is a slightly different problem than epistemic uncertainty. The easiest way too do this would be to reduce the problem to an epistemic one (these are the characteristics of the correct utility function, now reason probabilistically about which of these candidate functions is it), but that still has the action problem—an agent takes actions based on it’s utility function—if it has a weighting over all utility functions, it may act in undesirable ways, particularly if it doesn’t quickly converge to a single solution. There are a few other problems I could see with that approach—the original specification of ‘correctness’ has to be almost Friendliness-Complete; it must be specific enough to pick out a single function (or perhaps many functions, all of which are what we want to want, without being compatible with any undesirable solutions). Also, a seed AI may not be able to follow the specification correctly, a superintelligence is going to have to have some well-specified goal along the lines of “increase your capability without doing anything bad, until you have the ability to solve this problem, and then adopt the solution as your utility function”. You may noticed a familiar problem in the bolded part of that (English—remember we have to be able to code all of this) sentence.
What if the AI’s utility function is to find the right utility function
Coding your appreciation of ‘right’ is more difficult than you think.
I mean, instead of coding it, have it be uncertain about what is “right,” and to guide itself using human claims. I’m thinking of the equivalent of something in EY’s CFAI, but I’ve forgotten the terminology.
In other words, a meta-utility function. Why can’t it weight actions based on what we as a society want/like/approve/consent/condone? A behavioristic learner, with reward/punishment and an intention to preserve the semantic significance of the reward/punishment channel.
if it has a weighting over all utility functions, it may act in undesirable ways, particularly if it doesn’t quickly converge to a single solution.
When I said uncertainty, I was also implying inaction. I suppose inaction could be an undesirable way in which to act, but it’s better to get it right slowly than to get it wrong very quickly. What I’m describing isn’t really a utility function, it’s more like a policy, or policy function. Its policy would be volatile, or at least, more volatile than the common understanding LW has of a set-in-stone utility function.
If a utility function really needs to be pinpointed so exactly, surrounded by death and misery on all sides, why are we using a utility function to decide action? There are other approaches. Where did LW’s/EY’s concept of utility function come from, and why did they assume it was an essential part of AI?
Why can’t it weight actions based on what we as a society want/like/approve/consent/condone? A behavioristic learner, with reward/punishment and an intention to preserve the semantic significance of the reward/punishment channel.
Most obviously, it’s very easy for a powerful AI to take unexpected control of the reward/punishment channel, and trivial for a superintelligent AGI to do so in Very Bad ways. You’ve tried to block the basic version of this—an AGI pressing its own “society liked this” button—with the phrase ‘semantic significance’, but that’s not really a codable concept. If the AGI isn’t allowed to press the button itself, it might build a machine that would do so. If it isn’t allowed to do that, it might wirehead a human into doing so. If it isn’t allowed /that/, it might put a human near a Paradise Machine and only let them into the box when the button had been pressed. If the AGI’s reward is based on the number of favorable news reports, now you have an AGI that’s rewarded for manipulating its own media coverage. So on, and so forth.
The sort of semantic significance you’re talking about is a pretty big part of Friendliness theory.
The deeper problem is that the things our society wants aren’t necessarily Friendly, especially when extrapolated. One of the secondary benefits of Friendliness research is that it requires the examination of our own interests.
Its policy would be volatile, or at least, more volatile than the common understanding LW has of a set-in-stone utility function.
The ‘set-in-stone’ nature of a utility function is actually a desired benefit, albeit a difficult one to achieve (“Lob’s Problem” and the more general issue of value drift). A machine with undirected volatility in its utility function will take random variations in its choices, and there are orders of magnitude more wrong random answers than correct ones on this matter.
If you can direct the drift, that’s less of an issue, but then you could just make /that/ direction the utility function.
Where did LW’s/EY’s concept of utility function come from, and why did they assume it was an essential part of AI?
The basic idea of goal maximization is a fairly common thing when working with evolutionary algorithms (see XKCD for a joking example), because it’s such a useful model. While there are other types of possible minds, maximizers of /some/ kind with unbounded or weakly bounded potential are the most relevant to MIRI’s concerns because they have the greatest potential for especially useful and especially harmful results.
Why can’t it weight actions based on what we as a society want/like/approve/consent/condone?
Human society would not do a good job being directly in charge of a naive omnipotent genie. Insert your own nightmare scenario examples here, there are plenty to choose from.
What I’m describing isn’t really a utility function, it’s more like a policy, or policy function. Its policy would be volatile, or at least, more volatile than the common understanding LW has of a set-in-stone utility function.
Why can’t it weight actions based on what we as a society w/l/a/c/c?
Human society would not do a good job being directly in charge of a naive omnipotent genie. Insert your own nightmare scenario examples here, there are plenty to choose from.
But that doesn’t describe humanity being directly in charge. It only describes a small bit of influence for each person, and while groups would have leverage, that doesn’t mean a majority rejecting, say, homosexuality, gets to say what LGB people can and can’t do/be.
What I’m describing isn’t really a utility function, it’s more like a policy, or policy function. Its policy would be volatile, or at least, more volatile than the common understanding LW has of a set-in-stone utility function.
What would be in charge of changing the policy?
The metautility function I described.
What is a society’s intent? What should a society’s goals be, and how should it relate to the goals of its constituents?
Good point. I think I was reluctant to use pedophilia as an example because I’m trying to defend this argument, and claiming it could allow pedophilia is not usually convincing. RAT − 1 for me.
I’ll concede that point. But my questions aren’t rhetorical, I think. There is no objective morality, and EY seems to be trying to get around that. Concessions must be made.
I’m thinking that the closest thing we could have to CEV is a social contract based on Rawls’ veil of ignorance, adjusted with live runoff of supply/demand (i.e. the less people want slavery, the more likely that someone who wants slavery would become a slave, so prospective slaveowners would be less likely to approve of slavery on the grounds that they themselves do not want to be slaves. Meanwhile, people who want to become slaves get what they want as well. By no means is this a rigorous definition or claim.), in a post-scarcity economy, with sharding of some sort (as in CelestAI sharding, where parts of society that contribute negative utility to an individual are effectively invisible to said individual. There was an argument on LW that CEV would be impossible without some elements of separation similar to this).
The less people want aristocracy, the more likely that someone who wants aristocracy would become a noble, so prospective nobles would be more like to approve of aristocracy on the grounds that they themselves want to be nobles?
The less people want aristocracy, the more likely that someone who wants aristocracy would become a peon, so prospective nobles would be less likely to approve of aristocracy on the grounds that they themselves want to be peons.
(I am in the midst of reading the EY-RH “FOOM” debate, so some of the following may be less informed than would be ideal.)
From a purely technical standpoint, one problem is that if you permit self-modification, and give the baby AI enough insight into its own structure to make self-modification remotely a useful thing to do (as opposed to making baby repeatedly crash, burn, and restore from backup), then you cannot guarantee that utility() won’t be modified in arbitrary ways. Even if you store the actual code implementing utility() in ROM, baby could self-modify to replace all references to that fixed function with references to a different (modifiable) one.
What you need is for utility() to be some kind of fixed point in utility-function space under whatever modification regime is permitted, or… something. This problem seems nigh-insoluble to me, at the moment. Even if you solve the theoretical problem of preserving those aspects of utility() that ensure Friendliness, a cosmic-ray hit might change a specific bit of memory and turn baby into a monster. (Though I suppose you could arrange, mathematically, for that particular possibility to be astronomically unlikely.)
I think the important insight you may be missing is that the AI, if intelligent enough to recursively self-improve, can predict what the modifications it makes will do (and if it can’t, then it doesn’t make that modification because creating an unpredictable child AI would be a bad move according to almost any utility function, even that of a paperclipper). And it evaluates the suitability of these modifications using its utility function. So assuming the seed AI is build with a sufficiently solid understanding of self-modification and what its own code is doing, it will more or less automatically work to create more powerful AIs whose actions will also be expected to fulfill the original utility function, no “fixed points” required.
There is a hypothetical danger region where an AI has sufficient intelligence to create a more powerful child AI, isn’t clever enough to predict the actions of AIs with modified utility functions, and isn’t self-aware enough to realize this and compensate by, say, not modifying the utility function itself. Obviously the space of possible minds is sufficiently large that there exist minds with this problem, but it probably doesn’t even make it into the top 10 most likely AI failure modes at the moment.
I’m not so sure about that particular claim for volatile utility. I thought intelligence-utility orthogonality would mean that improvements from seed AI would not EDIT: endanger its utility function.
What if the AI’s utility function is to find the right utility function, being guided along the way? Its goals could be such as learning to understand us, obey us, and predict what we might want/like/approve, moving its object-level goals to what would satisfy humanity? In other words, a probabilistic utility function with great amounts of uncertainty, and great amounts of apprehension to change, or stability.
Regardless of the above questions/statement, I think much of the complexity of human utility comes from complexities of belief.
If we offload complexity of the AI’s utility function into very uncertainly defined concepts, and give it an apprehension to do anything but observe given such little data… I don’t know, though. This has been something I’ve been sitting on for a while, lambast me.
As one last thing, I think the best kind of FAI would be a singleton, with a metautility function, or society’s utility function. I think one part of Friendliness would be determining a utility function for society, as to how people can interfere with each other in what circumstances, and then build the genie’s utility function in the singleton’s constraints.
Please critique. If my ideas are as unclear as I think they may be (I’m sick), please mention it.
Coding your appreciation of ‘right’ is more difficult than you think. This is, essentially, what CEV is—an attempt at figuring out how an FAI can find the ‘right’ utility function.
You’re talking about normative uncertainty, which is a slightly different problem than epistemic uncertainty. The easiest way too do this would be to reduce the problem to an epistemic one (these are the characteristics of the correct utility function, now reason probabilistically about which of these candidate functions is it), but that still has the action problem—an agent takes actions based on it’s utility function—if it has a weighting over all utility functions, it may act in undesirable ways, particularly if it doesn’t quickly converge to a single solution. There are a few other problems I could see with that approach—the original specification of ‘correctness’ has to be almost Friendliness-Complete; it must be specific enough to pick out a single function (or perhaps many functions, all of which are what we want to want, without being compatible with any undesirable solutions). Also, a seed AI may not be able to follow the specification correctly, a superintelligence is going to have to have some well-specified goal along the lines of “increase your capability without doing anything bad, until you have the ability to solve this problem, and then adopt the solution as your utility function”. You may noticed a familiar problem in the bolded part of that (English—remember we have to be able to code all of this) sentence.
I mean, instead of coding it, have it be uncertain about what is “right,” and to guide itself using human claims. I’m thinking of the equivalent of something in EY’s CFAI, but I’ve forgotten the terminology.
In other words, a meta-utility function. Why can’t it weight actions based on what we as a society want/like/approve/consent/condone? A behavioristic learner, with reward/punishment and an intention to preserve the semantic significance of the reward/punishment channel.
When I said uncertainty, I was also implying inaction. I suppose inaction could be an undesirable way in which to act, but it’s better to get it right slowly than to get it wrong very quickly. What I’m describing isn’t really a utility function, it’s more like a policy, or policy function. Its policy would be volatile, or at least, more volatile than the common understanding LW has of a set-in-stone utility function.
If a utility function really needs to be pinpointed so exactly, surrounded by death and misery on all sides, why are we using a utility function to decide action? There are other approaches. Where did LW’s/EY’s concept of utility function come from, and why did they assume it was an essential part of AI?
Most obviously, it’s very easy for a powerful AI to take unexpected control of the reward/punishment channel, and trivial for a superintelligent AGI to do so in Very Bad ways. You’ve tried to block the basic version of this—an AGI pressing its own “society liked this” button—with the phrase ‘semantic significance’, but that’s not really a codable concept. If the AGI isn’t allowed to press the button itself, it might build a machine that would do so. If it isn’t allowed to do that, it might wirehead a human into doing so. If it isn’t allowed /that/, it might put a human near a Paradise Machine and only let them into the box when the button had been pressed. If the AGI’s reward is based on the number of favorable news reports, now you have an AGI that’s rewarded for manipulating its own media coverage. So on, and so forth.
The sort of semantic significance you’re talking about is a pretty big part of Friendliness theory.
The deeper problem is that the things our society wants aren’t necessarily Friendly, especially when extrapolated. One of the secondary benefits of Friendliness research is that it requires the examination of our own interests.
The ‘set-in-stone’ nature of a utility function is actually a desired benefit, albeit a difficult one to achieve (“Lob’s Problem” and the more general issue of value drift). A machine with undirected volatility in its utility function will take random variations in its choices, and there are orders of magnitude more wrong random answers than correct ones on this matter.
If you can direct the drift, that’s less of an issue, but then you could just make /that/ direction the utility function.
The basic idea of goal maximization is a fairly common thing when working with evolutionary algorithms (see XKCD for a joking example), because it’s such a useful model. While there are other types of possible minds, maximizers of /some/ kind with unbounded or weakly bounded potential are the most relevant to MIRI’s concerns because they have the greatest potential for especially useful and especially harmful results.
Human society would not do a good job being directly in charge of a naive omnipotent genie. Insert your own nightmare scenario examples here, there are plenty to choose from.
What would be in charge of changing the policy?
But that doesn’t describe humanity being directly in charge. It only describes a small bit of influence for each person, and while groups would have leverage, that doesn’t mean a majority rejecting, say, homosexuality, gets to say what LGB people can and can’t do/be.
The metautility function I described.
What is a society’s intent? What should a society’s goals be, and how should it relate to the goals of its constituents?
I think it means precisely that if the majority feels strongly enough about it.
For a quick example s/homosexuality/pedophilia/
Good point. I think I was reluctant to use pedophilia as an example because I’m trying to defend this argument, and claiming it could allow pedophilia is not usually convincing. RAT − 1 for me.
I’ll concede that point. But my questions aren’t rhetorical, I think. There is no objective morality, and EY seems to be trying to get around that. Concessions must be made.
I’m thinking that the closest thing we could have to CEV is a social contract based on Rawls’ veil of ignorance, adjusted with live runoff of supply/demand (i.e. the less people want slavery, the more likely that someone who wants slavery would become a slave, so prospective slaveowners would be less likely to approve of slavery on the grounds that they themselves do not want to be slaves. Meanwhile, people who want to become slaves get what they want as well. By no means is this a rigorous definition or claim.), in a post-scarcity economy, with sharding of some sort (as in CelestAI sharding, where parts of society that contribute negative utility to an individual are effectively invisible to said individual. There was an argument on LW that CEV would be impossible without some elements of separation similar to this).
The less people want aristocracy, the more likely that someone who wants aristocracy would become a noble, so prospective nobles would be more like to approve of aristocracy on the grounds that they themselves want to be nobles?
The less people want aristocracy, the more likely that someone who wants aristocracy would become a peon, so prospective nobles would be less likely to approve of aristocracy on the grounds that they themselves want to be peons.
I have to work this out. You have a good point.
(I am in the midst of reading the EY-RH “FOOM” debate, so some of the following may be less informed than would be ideal.)
From a purely technical standpoint, one problem is that if you permit self-modification, and give the baby AI enough insight into its own structure to make self-modification remotely a useful thing to do (as opposed to making baby repeatedly crash, burn, and restore from backup), then you cannot guarantee that utility() won’t be modified in arbitrary ways. Even if you store the actual code implementing utility() in ROM, baby could self-modify to replace all references to that fixed function with references to a different (modifiable) one.
What you need is for utility() to be some kind of fixed point in utility-function space under whatever modification regime is permitted, or… something. This problem seems nigh-insoluble to me, at the moment. Even if you solve the theoretical problem of preserving those aspects of utility() that ensure Friendliness, a cosmic-ray hit might change a specific bit of memory and turn baby into a monster. (Though I suppose you could arrange, mathematically, for that particular possibility to be astronomically unlikely.)
I think the important insight you may be missing is that the AI, if intelligent enough to recursively self-improve, can predict what the modifications it makes will do (and if it can’t, then it doesn’t make that modification because creating an unpredictable child AI would be a bad move according to almost any utility function, even that of a paperclipper). And it evaluates the suitability of these modifications using its utility function. So assuming the seed AI is build with a sufficiently solid understanding of self-modification and what its own code is doing, it will more or less automatically work to create more powerful AIs whose actions will also be expected to fulfill the original utility function, no “fixed points” required.
There is a hypothetical danger region where an AI has sufficient intelligence to create a more powerful child AI, isn’t clever enough to predict the actions of AIs with modified utility functions, and isn’t self-aware enough to realize this and compensate by, say, not modifying the utility function itself. Obviously the space of possible minds is sufficiently large that there exist minds with this problem, but it probably doesn’t even make it into the top 10 most likely AI failure modes at the moment.
I’m not so sure about that particular claim for volatile utility. I thought intelligence-utility orthogonality would mean that improvements from seed AI would not EDIT: endanger its utility function.
...What? I think you mean, need not be in danger, which tells us almost nothing about the probability.
Sorry, it was a typo. I edited it to reflect my probable meaning.