This may be a naive question, which has a simple answer, but I haven’t seen it. Please enlighten me.
I’m not clear on why an AI should have a utility function at all.
The computer I’m typing this on doesn’t. It simply has input-output behavior. When I hit certain keys it reacts in certain, very complex ways, but it doesn’t decide. It optimizes, but only when I specifically tell it to do so, and only on the parameters that I give it.
We tend to think of world-shaping GAI as an agent with it’s own goals, which it seeks to implement. Why can’t it be more like a computing machine in a box. We could feed it questions, like “given this data, will it rain tomorrow?”, or “solve this protein folding problem”, or “which policy will best reduce gun-violence?”, or even “given these specific parameters and definitions, how do we optimize for human happiness?” For the complex answers like the last of those, we could then ask the AI to model the state of the world that results from following this policy. If we see that it leads to tiling the universe with smiley faces, we know that we made a mistake somewhere (that wasn’t what we were trying to optimize for), and adjust the parameters. We might even train the AI over time, so that it learns how to interpret what we mean from what we say. When the AI models a state of the world that actually reflects our desires, then we implement it’s suggestions ourselves, or perhaps only then hit the implement button, by with the AI takes the steps to carry out it’s plan. We might even use such a system to check the safety of future generations of the AI. This would slow recursive self improvement, but it seems it would be much safer.
This has been proposed before, and on LW is usually referred to as “Oracle AI”. There’s an entry for it on the LessWrong wiki, including some interesting links to various discussions of the idea. Eliezer has addressed it as well.
See also Tool AI, from the discussions between Holden Karnofsky and LW.
A process can progress towards some end state even without having any representation of that state. Imagine a program that takes a positive number at the beginning, and at each step replaces the current number “x” with value “x/2 + 1/x”. Regardless of the original number, the values will gradually move towards a constant. Can we say that this process has a “goal” or achieving the given number? It feels wrong to use this word here, because the constant is nowhere in the process, it just happens.
Typically, when we speak about having a “goal” X, we mean that somewhere (e.g. in human brain, or in the company’s mission statement) there is a representation of X, and then the reality is compared with X, various paths from here to X are evaluated, and then one of those paths is followed.
I am saying this to make more obvious that there is a difference between “having a representation of X” and “progressing towards X”. Humans typically create representations of their desired end states, and then try finding a way to achieve them. Your computer doesn’t have this, and neither does “Tool AI” at the beginning. Whether it can create representations later, that depends on technical details, how specifically such “Tool AI” is programmed.
Maybe there is a way to allow superhuman thinking even without creating representations corresponding to things normally perceived in our world. (For example AIXI.) But even in such case, there is a risk of having a pseudo-goal of the “x/2 + 1/x” kind, where the process progresses towards an outcome even without having a representation of it. AI can “escape from the box” even without having a representation of “box” and “escape”, if there exists a way to escape from it.
I don’t get this explanation. Sure, a process can tend toward a certain result, without having an explicit representation of that result. But such tendencies often seem to be fragile. For example, a car engine homeostatically tends toward a certain idle speed. But take out one or all spark plugs, and the previously stable performance evaporates. Goals-as-we-know-them, by contrast, tend to be very robust. When a human being loses a leg, they will obtain a synthetic one, or use a wheelchair. That kind of robustness is part of what makes a very powerful agent scary, because it is intimately related to the agent’s seeing many things as potential resources to use toward its ends.
First, there’s the political problem: if you can build agent AI and just choose not to, this doesn’t help very much when someone else builds their UFAI (which they want to do, because agent AI is very powerful and therefore very useful). So you have to get everyone on board with the plan first. Also, having your superintelligent oracle makes it much easier for someone else to build an agent: just ask the oracle how. If you don’t solve Friendliness, you have to solve the incentives instead, and “solve politics” doesn’t look much easier than “solve metaethics.”
Second, the distinction between agents and oracles gets fuzzy when the AI is much smarter than you. Suppose you ask the AI how to reduce gun violence: it spits out a bunch of complex policy changes, which are hard for you to predict the effects of. But you implement them, and it turns out that they result in drastically reduced willingness to have children. The population plummets, and gun violence deaths do too. “Okay, how do I reduce per capita gun violence?”, you ask. More complex policy changes; this time they result in increased pollution which disproportionately depopulates the demographics most likely to commit gun violence. “How do I reduce per capita gun violence without altering the size or demographic ratios of the population?” Its recommendations cause a worldwide collapse of the firearms manufacturing industry, and gun violence plummets, along with most metrics of human welfare.
If you have to blindly implement policies you can’t understand, you’re not really much better off than letting the AI implement them directly. There are some things you can do to mitigate this, but ultimately the AI is smarter than you. If you could fully understand all its ideas, you wouldn’t have needed to ask it.
Does this sound familiar? It’s the untrustworthy genie problem again. We need a trustworthy genie, one that will answer the questions we mean to ask, not just the questions we actually ask. So we need an oracle that understands and implements human values, which puts us right back at the original problem of Friendliness!
Non-agent AI might be a useful component of realistic safe AI development, just as “boxing” might be. Seatbelts are a good idea too, but it only matters if something has already gone wrong. Similarly, oracle AI might help, but it’s not a replacement for solving the actual problem.
This is actually one of the standard counterarguments against the need for friendly AI, at least against the notion that is should be an agent / be capable of acting as an agent.
I’ll try to quickly summarize the counter-counter arguments Nick Bostrom gives in Superintelligence. (In the book, AI that is not agent at all is called tool AI. AI that is an agent but cannot act as one (has no executive power in the real world) is called oracle AI.)
Some arguments have already been mentioned:
Tool AI or friendly AI without executive power cannot stop the world from building UFAI. Its abilities to prevent this and other existential risks are greatly diminished. It especially cannot guard us against the “unknown unknowns” (an oracle is not going to give answers to questions we are not asking.)
The decisions of an oracle or tool AI might look good, but actually be bad for us in ways we cannot recognize.
There is also a possibility of what Bostrom calls mind crime. If a tool or oracle AI is not inherently friendly, it might simulate sentient minds in order to give the answers to the questions that we ask; kill or possibly even torture these minds. The possibility that these simulations have moral rights is low, but there can be trillions of them, so even a low probability cannot be ignored.
Finally, it might be that the best strategy for a tool AI to give answer is to internally develop an agent-type AI that is capable of self-improvement. If the default outcome of creating a self-improving AI is doom, then the tool AI scenario might in fact be less safe.
This may be a naive question, which has a simple answer, but I haven’t seen it. Please enlighten me.
I’m not clear on why an AI should have a utility function at all.
The computer I’m typing this on doesn’t. It simply has input-output behavior. When I hit certain keys it reacts in certain, very complex ways, but it doesn’t decide. It optimizes, but only when I specifically tell it to do so, and only on the parameters that I give it.
We tend to think of world-shaping GAI as an agent with it’s own goals, which it seeks to implement. Why can’t it be more like a computing machine in a box. We could feed it questions, like “given this data, will it rain tomorrow?”, or “solve this protein folding problem”, or “which policy will best reduce gun-violence?”, or even “given these specific parameters and definitions, how do we optimize for human happiness?” For the complex answers like the last of those, we could then ask the AI to model the state of the world that results from following this policy. If we see that it leads to tiling the universe with smiley faces, we know that we made a mistake somewhere (that wasn’t what we were trying to optimize for), and adjust the parameters. We might even train the AI over time, so that it learns how to interpret what we mean from what we say. When the AI models a state of the world that actually reflects our desires, then we implement it’s suggestions ourselves, or perhaps only then hit the implement button, by with the AI takes the steps to carry out it’s plan. We might even use such a system to check the safety of future generations of the AI. This would slow recursive self improvement, but it seems it would be much safer.
This has been proposed before, and on LW is usually referred to as “Oracle AI”. There’s an entry for it on the LessWrong wiki, including some interesting links to various discussions of the idea. Eliezer has addressed it as well.
See also Tool AI, from the discussions between Holden Karnofsky and LW.
I was just reading though the Eliezer article. I’m not sure I understand. Is he saying that my computer actually does have goals?
Isn’t there a difference between simple cause and effect and an optimization process that aims at some specific state?
Maybe it would help to “taboo” the word “goal”.
A process can progress towards some end state even without having any representation of that state. Imagine a program that takes a positive number at the beginning, and at each step replaces the current number “x” with value “x/2 + 1/x”. Regardless of the original number, the values will gradually move towards a constant. Can we say that this process has a “goal” or achieving the given number? It feels wrong to use this word here, because the constant is nowhere in the process, it just happens.
Typically, when we speak about having a “goal” X, we mean that somewhere (e.g. in human brain, or in the company’s mission statement) there is a representation of X, and then the reality is compared with X, various paths from here to X are evaluated, and then one of those paths is followed.
I am saying this to make more obvious that there is a difference between “having a representation of X” and “progressing towards X”. Humans typically create representations of their desired end states, and then try finding a way to achieve them. Your computer doesn’t have this, and neither does “Tool AI” at the beginning. Whether it can create representations later, that depends on technical details, how specifically such “Tool AI” is programmed.
Maybe there is a way to allow superhuman thinking even without creating representations corresponding to things normally perceived in our world. (For example AIXI.) But even in such case, there is a risk of having a pseudo-goal of the “x/2 + 1/x” kind, where the process progresses towards an outcome even without having a representation of it. AI can “escape from the box” even without having a representation of “box” and “escape”, if there exists a way to escape from it.
I don’t get this explanation. Sure, a process can tend toward a certain result, without having an explicit representation of that result. But such tendencies often seem to be fragile. For example, a car engine homeostatically tends toward a certain idle speed. But take out one or all spark plugs, and the previously stable performance evaporates. Goals-as-we-know-them, by contrast, tend to be very robust. When a human being loses a leg, they will obtain a synthetic one, or use a wheelchair. That kind of robustness is part of what makes a very powerful agent scary, because it is intimately related to the agent’s seeing many things as potential resources to use toward its ends.
First, there’s the political problem: if you can build agent AI and just choose not to, this doesn’t help very much when someone else builds their UFAI (which they want to do, because agent AI is very powerful and therefore very useful). So you have to get everyone on board with the plan first. Also, having your superintelligent oracle makes it much easier for someone else to build an agent: just ask the oracle how. If you don’t solve Friendliness, you have to solve the incentives instead, and “solve politics” doesn’t look much easier than “solve metaethics.”
Second, the distinction between agents and oracles gets fuzzy when the AI is much smarter than you. Suppose you ask the AI how to reduce gun violence: it spits out a bunch of complex policy changes, which are hard for you to predict the effects of. But you implement them, and it turns out that they result in drastically reduced willingness to have children. The population plummets, and gun violence deaths do too. “Okay, how do I reduce per capita gun violence?”, you ask. More complex policy changes; this time they result in increased pollution which disproportionately depopulates the demographics most likely to commit gun violence. “How do I reduce per capita gun violence without altering the size or demographic ratios of the population?” Its recommendations cause a worldwide collapse of the firearms manufacturing industry, and gun violence plummets, along with most metrics of human welfare.
If you have to blindly implement policies you can’t understand, you’re not really much better off than letting the AI implement them directly. There are some things you can do to mitigate this, but ultimately the AI is smarter than you. If you could fully understand all its ideas, you wouldn’t have needed to ask it.
Does this sound familiar? It’s the untrustworthy genie problem again. We need a trustworthy genie, one that will answer the questions we mean to ask, not just the questions we actually ask. So we need an oracle that understands and implements human values, which puts us right back at the original problem of Friendliness!
Non-agent AI might be a useful component of realistic safe AI development, just as “boxing” might be. Seatbelts are a good idea too, but it only matters if something has already gone wrong. Similarly, oracle AI might help, but it’s not a replacement for solving the actual problem.
This is actually one of the standard counterarguments against the need for friendly AI, at least against the notion that is should be an agent / be capable of acting as an agent.
I’ll try to quickly summarize the counter-counter arguments Nick Bostrom gives in Superintelligence. (In the book, AI that is not agent at all is called tool AI. AI that is an agent but cannot act as one (has no executive power in the real world) is called oracle AI.)
Some arguments have already been mentioned:
Tool AI or friendly AI without executive power cannot stop the world from building UFAI. Its abilities to prevent this and other existential risks are greatly diminished. It especially cannot guard us against the “unknown unknowns” (an oracle is not going to give answers to questions we are not asking.)
The decisions of an oracle or tool AI might look good, but actually be bad for us in ways we cannot recognize.
There is also a possibility of what Bostrom calls mind crime. If a tool or oracle AI is not inherently friendly, it might simulate sentient minds in order to give the answers to the questions that we ask; kill or possibly even torture these minds. The possibility that these simulations have moral rights is low, but there can be trillions of them, so even a low probability cannot be ignored.
Finally, it might be that the best strategy for a tool AI to give answer is to internally develop an agent-type AI that is capable of self-improvement. If the default outcome of creating a self-improving AI is doom, then the tool AI scenario might in fact be less safe.
If you use a spell checking engine while you are typing that likely has an utility function buried in it’s code.