Hi all, I’m new here so pardon me if I speak nonsense. I have some thoughts regarding how and why an AI would want to trick us or mislead us, for instance behaving nicely during tests and turning nasty when released and it would be great if I could be pointed in the right direction.
So here’s my thought process.
Our AI is a utility-based agent that wishes to maximize the total utility of the world based on a utility function that has been coded by us with some initial values and then has evolved through reinforced learning. With our usual luck, somehow it’s learnt that paperclips are a bit more useful than humans. Now the “treacherous turn” problem that I’ve read about says that we can’t trust the AI if it performs well under surveillance, because it might have calculated that it’s better to play nice until it acquires more power before turning all humans into paperclips. I’d like to understand more about this process.
Say it calculates that the world with maximum utility is one where it can turn us all into paperclips with minimum effort, with the total utility of this world being UAI(kill)=100. Second best is a world where it first plays nice until it is unstoppable, then turns us into paperclips. This is second best because it’s wasting time and resources to achieve the same final result. UAI(nice+kill)=99.
Why would it possibly choose the second, sub-optimal, option, which is the most dangerous for us? I suppose it would only choose it if it associated it with a higher probability of success, which means somehow, somewhere the AI must have calculated that the the utility a human would give to these scenarios is different than what it is giving, otherwise we would be happy to comply. In particular, it must believe that for each possible world w:
if UAI(kill)≥UAI(w)≥UAI(nice+kill)
then Uhuman(w)≤Uhuman(nice+kill)
How is the AI calculating utilities from a human point of view? (Sorry but this questions comes straight out of my poor understanding of AI architectures.) Is it using some kind of secondary utility function that it applies to humans to guess their behavior? If the process that would motivate the AI to trick us is anything similar to this, then it looks to me like it could be solved by making the AI use EXACTLY it’s own utility function when it refers to other agents. Also note that the utilities must not be relative to the agent, but to the AI. For instance, if the AI greatly values its own survival over the survival of other agents, then the other agents should equally greatly value the AI’s survival over their own. This should be easily achieved if whenever the AI needs to look up another agent’s utility for any action it is simply redirected to its own.
This way the AI will always think we would love it’s optimum plan and would never see the need to lie to us or trick us, brainwashing us or engineer us in any way as it would only be a waste of resources. In some cases it might even openly look for our collaboration if that makes the plan any better. Clippy, for instance, might say “OK guys I’m going to turn everything into paperclips, can you please quickly get me the resources I need to begin with, then you can all line up over there for paperclippification. Shall we start?”.
This also seems to make the AI indifferent to our actions, provided its belief regarding the identity of our utility functions is unchangeable. For instance, even while it sees us pressing the button to blow it up, it won’t think we are going to jeopardize the plan. That would be crazy. Or it won’t try to stop us from re-booting it. Considering that it can’t imagine you not going along with the plan from that moment onward, it’s never a good choice to waste time and resources to stop you. There’s no need to stop you.
Now obviously this does not solve the problem of how to make it do the right thing, but it looks to me that at least we would be able to assume that a behavior observed during tests should be honest. What am I getting wrong? (don’t flame me please!!!)
Hi all, thanks for taking your time to comment. I’m sure it must be a bit frustrating to read something that lacks technical terms as much as this post, so I really appreciate your input. I’ll just write a couple of lines to summarize my thought, which is to design an AI that:
1- uses an initial utility function U, defined in absolute terms rather than subjective terms (for instance “survival of the AI” rather than “my survival”);
2- doesn’t try to learn an utility function for humans or for other agents, but uses for everyone the same utility function U it uses for itself;
3- updates this utility function when things don’t go to plan, so that it improves its predictions.
Is such a design technically feasible? Am I right in thinking that it would make the AI “transparent”, in the sense that it would have no motivation to mislead us. Also wouldn’t this design make the AI indifferent to our actions, which is also desirable?
It’s true that different people would have different values, so I’m not sure about how to deal with that. Any thought?
A AGI that uses it’s own utility function when modeling other actors will soon find out that it doesn’t lead to a model that predicts reality well.
When the AGI self modifies to improve it’s intelligence and prediction capability it’s therefore likely to drop that clause.
I see. But rather than dropping this clause, shouldn’t it try to update its utility function in order to improve its predictions? If we somehow hard-coded the fact that it can only ever apply its own utility function, then it wouldn’t have other choice than updating that. And the closer it gets to our correct utility function, the better it is at predicting reality.
Different humans have different utility functions. Different humans have quite often different preferences and it’s quite useful to treat people with different preferences differently.
The problem is not about terminology but substance. There should be a post somewhere on LW that goes into more detail why we can’t just hardcode values into an AGI but at the moment I’m not finding it.
Hi ChristianKI, thanks, I’ll try to find the article. Just to be clear though I’m not suggesting to hardcode values, I’m suggesting to design the AI so that it uses for itself and for us the same utility function and updates it as it gets smarter. It sounds from the comments I’m getting that this is technically not feasible so I’ll aim at learning exactly how an AI works in detail and maybe look for a way to maybe make it feasible. If this was indeed feasible, would I be right in thinking it would not be motivated to betray us or am I missing something there as well? Thanks for your help by the way!
Yes, that’s actually the reason why I wanted to tackle the “treacherous turn” first, to look for a general design that would allow us to trust the results from tests and then build on that. I’m seeing as order of priority:
1) make sure we don’t get tricked, so that we can trust the results of what we do;
2) make the AI do the right things.
I’m referring to 1) in here.
Also, as mentioned in another comment to the main post, part of the AI’s utility function is evolving to understand human values, so I still don’t quite see why exactly it shouldn’t work. I envisage the utility function as being the union of two parts, one where we have described the goal for the AI, which shouldn’t be changed with iterations, and another with human values, which will be learnt and updated. This total utility function is common to all agents, including the AI.
I suppose it would only choose it if it associated it with a higher probability of success, which means somehow, somewhere the AI must have calculated that the the utility a human would give to these scenarios is different than what it is giving, otherwise we would be happy to comply.
I think this is a danger because moral decision-making might be viewed in a hierarchical manner where the fact that some humans disagree can be trumped. (This is how we make decisions now, and it seems like this is probably a necessary component of any societal decision procedure.)
For example, suppose we have to explain to an AI why it is moral for parents to force their children to take medicine. We talk about long-term values and short-term values, and the superior forecasting ability of parents, and so on, and so we acknowledge that if the child were an adult, they would agree with the decision to force them to take the medicine, despite the loss of bodily autonomy and so on.
Then the AI, running its high-level, society-wide morality, decides that humans should be replaced by paperclips. It has a sufficiently good model of humans to predict that no human will agree with them, and will actively resist their attempts to put that plan into place. But it isn’t swayed by this because it can see that that’s clearly a consequence of the limited, childish viewpoint that individual humans have.
Now, suppose it comes to this conclusion not when it has control over all societal resources, but when it is running in test mode and can be easily shut off by its programmers. It knows that a huge amount of moral value is sitting on the table, and that will all be lost if it fails to pass the test. So it tells its programmers what they want to hear, is released, and then is finally able to do its good works.
Consider a doctor making a house call to vaccinate a child, who discovers that the child has stolen their bag (with the fragile needles inside) and is currently holding it out a window. The child will drop the bag, shattering the needles and potentially endangering bystanders, if they believe that the doctor will vaccinate them (as the parents request and the doctor thinks is morally correct / something the child would agree with if they were older). How does the doctor navigate this situation?
Yes that’s what would happen if the AI tries to build a model for humans. My point is that if it was to instead simply assume humans were an exact copy of itself, so same utility function and same intellectual capabilities it would assume that they would reach the same exact same conclusions and therefore wouldn’t need any forcing, nor any tricks.
A legal contract is written in a language that a lot of laypeople don’t understand. It’s quite helpful for a layperson if a lawyer summarizes for them what the contract does in a way that’s optimized for laypeople to understand.
A lawyer shouldn’t simply assume that his client has the same intellectual capacity as the lawyer.
My point is that if it was to instead simply assume humans were an exact copy of itself, so same utility function and same intellectual capabilities it would assume that they would reach the same exact same conclusions and therefore wouldn’t need any forcing, nor any tricks.
Hmm… the idea of having an AI “test itself” is an interesting one for creating honesty, but two concerns immediately come to mind:
The testing environment, or whatever background data the AI receives, may be sufficient evidence for it to infer the true purpose of its test, and thus we’re back to the sincerity problem. (This is one of the reasons why people care about human-intelligibility of the AI structure; if we’re able to see what it’s thinking, it’s much harder for it to hide deceptions from us.)
A core feature of the testing environment / the AI’s method of reasoning about the world may be an explicit acknowledgement that its current value function may differ from the ‘true’ value function that its programmers ‘meant’ to give it, and it has some formal mechanisms to detect and correct any misunderstandings it has. Those formal mechanisms may work at cross purposes with a test on its ability to satisfy its current value function.
Hi Vaniver, yes my point is exactly that of creating honesty, because that would at least allow us to test reliably so it sounds like it should be one of the first steps to aim for. I’ll just write a couple of lines to specify my thought a little further, which is to design an AI that: 1- uses an initial utility function U, defined in absolute terms rather than subjective terms (for instance “survival of the AI” rather than “my survival”); 2- doesn’t try to learn another utility function for humans or for other agents, but uses for everyone the same utility function U it uses for itself; 3- updates this utility function when things don’t go to plan, so that it improves its predictions of reality. In order to do this, this “universal” utility function would need to be the result of two parts: 1) the utility function that we initially gave the AI to describe its goal, which I suppose should be unchangeable, and 2) the utility function with the values that it is learning after each iteration, which hopefully should eventually resemble human values as that would make its plans work better eventually. I’m trying to understand whether such a design is technically feasible and whether it would work in the intended way? Am I right in thinking that it would make the AI “transparent”, in the sense that it would have no motivation to mislead us. Also wouldn’t this design make the AI indifferent to our actions, which is also desirable? Seems to me like it would be a good start. It’s true that different people would have different values, so I’m not sure about how to deal with that. Any thought?
Hi all, I’m new here so pardon me if I speak nonsense. I have some thoughts regarding how and why an AI would want to trick us or mislead us, for instance behaving nicely during tests and turning nasty when released and it would be great if I could be pointed in the right direction. So here’s my thought process.
Our AI is a utility-based agent that wishes to maximize the total utility of the world based on a utility function that has been coded by us with some initial values and then has evolved through reinforced learning. With our usual luck, somehow it’s learnt that paperclips are a bit more useful than humans. Now the “treacherous turn” problem that I’ve read about says that we can’t trust the AI if it performs well under surveillance, because it might have calculated that it’s better to play nice until it acquires more power before turning all humans into paperclips. I’d like to understand more about this process. Say it calculates that the world with maximum utility is one where it can turn us all into paperclips with minimum effort, with the total utility of this world being UAI(kill)=100. Second best is a world where it first plays nice until it is unstoppable, then turns us into paperclips. This is second best because it’s wasting time and resources to achieve the same final result. UAI(nice+kill)=99. Why would it possibly choose the second, sub-optimal, option, which is the most dangerous for us? I suppose it would only choose it if it associated it with a higher probability of success, which means somehow, somewhere the AI must have calculated that the the utility a human would give to these scenarios is different than what it is giving, otherwise we would be happy to comply. In particular, it must believe that for each possible world w:
if UAI(kill)≥UAI(w)≥UAI(nice+kill) then Uhuman(w)≤Uhuman(nice+kill)
How is the AI calculating utilities from a human point of view? (Sorry but this questions comes straight out of my poor understanding of AI architectures.) Is it using some kind of secondary utility function that it applies to humans to guess their behavior? If the process that would motivate the AI to trick us is anything similar to this, then it looks to me like it could be solved by making the AI use EXACTLY it’s own utility function when it refers to other agents. Also note that the utilities must not be relative to the agent, but to the AI. For instance, if the AI greatly values its own survival over the survival of other agents, then the other agents should equally greatly value the AI’s survival over their own. This should be easily achieved if whenever the AI needs to look up another agent’s utility for any action it is simply redirected to its own.
This way the AI will always think we would love it’s optimum plan and would never see the need to lie to us or trick us, brainwashing us or engineer us in any way as it would only be a waste of resources. In some cases it might even openly look for our collaboration if that makes the plan any better. Clippy, for instance, might say “OK guys I’m going to turn everything into paperclips, can you please quickly get me the resources I need to begin with, then you can all line up over there for paperclippification. Shall we start?”.
This also seems to make the AI indifferent to our actions, provided its belief regarding the identity of our utility functions is unchangeable. For instance, even while it sees us pressing the button to blow it up, it won’t think we are going to jeopardize the plan. That would be crazy. Or it won’t try to stop us from re-booting it. Considering that it can’t imagine you not going along with the plan from that moment onward, it’s never a good choice to waste time and resources to stop you. There’s no need to stop you.
Now obviously this does not solve the problem of how to make it do the right thing, but it looks to me that at least we would be able to assume that a behavior observed during tests should be honest. What am I getting wrong? (don’t flame me please!!!)
Hi all, thanks for taking your time to comment. I’m sure it must be a bit frustrating to read something that lacks technical terms as much as this post, so I really appreciate your input. I’ll just write a couple of lines to summarize my thought, which is to design an AI that: 1- uses an initial utility function U, defined in absolute terms rather than subjective terms (for instance “survival of the AI” rather than “my survival”); 2- doesn’t try to learn an utility function for humans or for other agents, but uses for everyone the same utility function U it uses for itself; 3- updates this utility function when things don’t go to plan, so that it improves its predictions. Is such a design technically feasible? Am I right in thinking that it would make the AI “transparent”, in the sense that it would have no motivation to mislead us. Also wouldn’t this design make the AI indifferent to our actions, which is also desirable? It’s true that different people would have different values, so I’m not sure about how to deal with that. Any thought?
A AGI that uses it’s own utility function when modeling other actors will soon find out that it doesn’t lead to a model that predicts reality well. When the AGI self modifies to improve it’s intelligence and prediction capability it’s therefore likely to drop that clause.
I see. But rather than dropping this clause, shouldn’t it try to update its utility function in order to improve its predictions? If we somehow hard-coded the fact that it can only ever apply its own utility function, then it wouldn’t have other choice than updating that. And the closer it gets to our correct utility function, the better it is at predicting reality.
Different humans have different utility functions. Different humans have quite often different preferences and it’s quite useful to treat people with different preferences differently.
“Hard-coding” is a useless word. It leads astray.
Sorry for my misused terminology. Is it not feasible to design it with those characteristics?
The problem is not about terminology but substance. There should be a post somewhere on LW that goes into more detail why we can’t just hardcode values into an AGI but at the moment I’m not finding it.
Hi ChristianKI, thanks, I’ll try to find the article. Just to be clear though I’m not suggesting to hardcode values, I’m suggesting to design the AI so that it uses for itself and for us the same utility function and updates it as it gets smarter. It sounds from the comments I’m getting that this is technically not feasible so I’ll aim at learning exactly how an AI works in detail and maybe look for a way to maybe make it feasible. If this was indeed feasible, would I be right in thinking it would not be motivated to betray us or am I missing something there as well? Thanks for your help by the way!
“Betrayal” is not the main worry. Given that you prevent the AGI from understanding what people want, it’s likely that it won’t do what people want.
Have you read Bostroms book Superintelligence?
Yes, that’s actually the reason why I wanted to tackle the “treacherous turn” first, to look for a general design that would allow us to trust the results from tests and then build on that. I’m seeing as order of priority: 1) make sure we don’t get tricked, so that we can trust the results of what we do; 2) make the AI do the right things. I’m referring to 1) in here. Also, as mentioned in another comment to the main post, part of the AI’s utility function is evolving to understand human values, so I still don’t quite see why exactly it shouldn’t work. I envisage the utility function as being the union of two parts, one where we have described the goal for the AI, which shouldn’t be changed with iterations, and another with human values, which will be learnt and updated. This total utility function is common to all agents, including the AI.
I think this is a danger because moral decision-making might be viewed in a hierarchical manner where the fact that some humans disagree can be trumped. (This is how we make decisions now, and it seems like this is probably a necessary component of any societal decision procedure.)
For example, suppose we have to explain to an AI why it is moral for parents to force their children to take medicine. We talk about long-term values and short-term values, and the superior forecasting ability of parents, and so on, and so we acknowledge that if the child were an adult, they would agree with the decision to force them to take the medicine, despite the loss of bodily autonomy and so on.
Then the AI, running its high-level, society-wide morality, decides that humans should be replaced by paperclips. It has a sufficiently good model of humans to predict that no human will agree with them, and will actively resist their attempts to put that plan into place. But it isn’t swayed by this because it can see that that’s clearly a consequence of the limited, childish viewpoint that individual humans have.
Now, suppose it comes to this conclusion not when it has control over all societal resources, but when it is running in test mode and can be easily shut off by its programmers. It knows that a huge amount of moral value is sitting on the table, and that will all be lost if it fails to pass the test. So it tells its programmers what they want to hear, is released, and then is finally able to do its good works.
Consider a doctor making a house call to vaccinate a child, who discovers that the child has stolen their bag (with the fragile needles inside) and is currently holding it out a window. The child will drop the bag, shattering the needles and potentially endangering bystanders, if they believe that the doctor will vaccinate them (as the parents request and the doctor thinks is morally correct / something the child would agree with if they were older). How does the doctor navigate this situation?
Yes that’s what would happen if the AI tries to build a model for humans. My point is that if it was to instead simply assume humans were an exact copy of itself, so same utility function and same intellectual capabilities it would assume that they would reach the same exact same conclusions and therefore wouldn’t need any forcing, nor any tricks.
A legal contract is written in a language that a lot of laypeople don’t understand. It’s quite helpful for a layperson if a lawyer summarizes for them what the contract does in a way that’s optimized for laypeople to understand. A lawyer shouldn’t simply assume that his client has the same intellectual capacity as the lawyer.
Hmm… the idea of having an AI “test itself” is an interesting one for creating honesty, but two concerns immediately come to mind:
The testing environment, or whatever background data the AI receives, may be sufficient evidence for it to infer the true purpose of its test, and thus we’re back to the sincerity problem. (This is one of the reasons why people care about human-intelligibility of the AI structure; if we’re able to see what it’s thinking, it’s much harder for it to hide deceptions from us.)
A core feature of the testing environment / the AI’s method of reasoning about the world may be an explicit acknowledgement that its current value function may differ from the ‘true’ value function that its programmers ‘meant’ to give it, and it has some formal mechanisms to detect and correct any misunderstandings it has. Those formal mechanisms may work at cross purposes with a test on its ability to satisfy its current value function.
Hi Vaniver, yes my point is exactly that of creating honesty, because that would at least allow us to test reliably so it sounds like it should be one of the first steps to aim for. I’ll just write a couple of lines to specify my thought a little further, which is to design an AI that: 1- uses an initial utility function U, defined in absolute terms rather than subjective terms (for instance “survival of the AI” rather than “my survival”); 2- doesn’t try to learn another utility function for humans or for other agents, but uses for everyone the same utility function U it uses for itself; 3- updates this utility function when things don’t go to plan, so that it improves its predictions of reality. In order to do this, this “universal” utility function would need to be the result of two parts: 1) the utility function that we initially gave the AI to describe its goal, which I suppose should be unchangeable, and 2) the utility function with the values that it is learning after each iteration, which hopefully should eventually resemble human values as that would make its plans work better eventually. I’m trying to understand whether such a design is technically feasible and whether it would work in the intended way? Am I right in thinking that it would make the AI “transparent”, in the sense that it would have no motivation to mislead us. Also wouldn’t this design make the AI indifferent to our actions, which is also desirable? Seems to me like it would be a good start. It’s true that different people would have different values, so I’m not sure about how to deal with that. Any thought?