I think that most easy to measure goals, if optimised hard enough, eventually end up with a universe tiled with molecular smiley faces. Consider the law enforcement AI. There is no sharp line between education programs, and reducing lead pollution, to using nanotech to rewire human brains into perfectly law abiding puppets. For most utility functions that aren’t intrinsically conservative, there will be some state of the universe that scores really highly, and is nothing like the present.
In any “what failure looks like” scenario, at some point you end up with superintelligent stock traiders that want to fill the universe with tiny molecular stock markets, competing with weather predicting AI’s that want to freeze the earth to a maximally predictable 0K block of ice.
These AI’s are wielding power that could easily wipe out humanity as a side effect. If they fight, humanity will get killed in the crossfire. If they work together, they will tile the universe with some strange mix of many different “molecular smiley faces”.
I don’t think that you can get an accurate human values function by averaging together many poorly thought out, add hoc functions that were designed to be contingent on specific details of how the world was. (Ie assuming people are broadcasting TV signals, stock market went up iff a particular pattern of electromagnetic waves encoding a picture of a graph going up, and the words “finantial news” exists. Outside the narrow slice of possible worlds with broadcast TV, this AI just wants to grab a giant radio transmittor and transmit a particular stream of nonsense.)
I think that humans existing is a specific state of the world, something that only happens if an AI is optimising for it. (And an actually good definition of human is hard to specify) Humans having lives we would consider good is even harder to specify. When there are substantially superhuman AI’s running around, the value of the atoms exceeds any value we can offer. The AI’s could psycologically or nanotechnologically twist us into whatever shape they pleased. We cant meaningfully threaten any of the AI.
We wont be left even a tiny fraction, we will be really bad at defending our resources, compared to any AI’s. Any of the AI’s could easily grab all our resources. Also there will be various AI’s that care about humans in the wrong way, a cancer curing AI that wants to wipe out humanity to stop us getting cancer. A marketing AI, that wants to fill all human brains with coorporate slogans. (think nanotech brain rewrite to the point of drooling vegetable)
EDIT: All of the above is talking about the end state of a “get what you measure” failure. There could be a period, possibly decades where humans are still around, but things are going wrong in the way described.
This was helpful to me, thanks. I agree this seems almost certainly to be the end state if AI systems are optimizing hard for simple, measurable objectives.
I’m still confused about what happens if AI systems are optimizing moderately for more complicated, measurable objectives (which better capture what humans actually want). Do you think the argument you made implies that we still eventually end up with a universe tiled with molecular smiley faces in this scenario?
I think that this depends on how hard the AI’s are optimising, and how complicated the objectives are. I think that sufficiently moderate optimization for goals sufficiently close to human values will probably end up well.
I also think that optimisation is likely to end up at the physical limits, unless we know how to program an AI that doesn’t want to improve itself, and everyone makes AI’s like that.
Sufficiently moderate AI is just dumb, which is safe. An AI smart enough to stop people producing more AI, yet dumb enough to be safe seems harder.
There is also a question of what “better capturing what humans want” means. A utility function, that when restricted to the space of worlds roughly similar to this one, produces utilities close to the true human utility function, seems easy enough. Suppose we have defined something close to human well being. That definition is in terms of the level of various neurotransmitters near human DNA. Lets suppose this definition would be highly accurate over all history, and would make the right decision over nearly all current political issues. It could still fail completely in a future containing uploaded minds, and neurochemical vats.
Either your approximate utility function needs to be pretty close on all possible futures (even adversarially chosen ones) or you need to know that the AI won’t guide the future towards places that the utility functions differ.
I think that most easy to measure goals, if optimised hard enough, eventually end up with a universe tiled with molecular smiley faces. Consider the law enforcement AI. There is no sharp line between education programs, and reducing lead pollution, to using nanotech to rewire human brains into perfectly law abiding puppets. For most utility functions that aren’t intrinsically conservative, there will be some state of the universe that scores really highly, and is nothing like the present.
In any “what failure looks like” scenario, at some point you end up with superintelligent stock traiders that want to fill the universe with tiny molecular stock markets, competing with weather predicting AI’s that want to freeze the earth to a maximally predictable 0K block of ice.
These AI’s are wielding power that could easily wipe out humanity as a side effect. If they fight, humanity will get killed in the crossfire. If they work together, they will tile the universe with some strange mix of many different “molecular smiley faces”.
I don’t think that you can get an accurate human values function by averaging together many poorly thought out, add hoc functions that were designed to be contingent on specific details of how the world was. (Ie assuming people are broadcasting TV signals, stock market went up iff a particular pattern of electromagnetic waves encoding a picture of a graph going up, and the words “finantial news” exists. Outside the narrow slice of possible worlds with broadcast TV, this AI just wants to grab a giant radio transmittor and transmit a particular stream of nonsense.)
I think that humans existing is a specific state of the world, something that only happens if an AI is optimising for it. (And an actually good definition of human is hard to specify) Humans having lives we would consider good is even harder to specify. When there are substantially superhuman AI’s running around, the value of the atoms exceeds any value we can offer. The AI’s could psycologically or nanotechnologically twist us into whatever shape they pleased. We cant meaningfully threaten any of the AI.
We wont be left even a tiny fraction, we will be really bad at defending our resources, compared to any AI’s. Any of the AI’s could easily grab all our resources. Also there will be various AI’s that care about humans in the wrong way, a cancer curing AI that wants to wipe out humanity to stop us getting cancer. A marketing AI, that wants to fill all human brains with coorporate slogans. (think nanotech brain rewrite to the point of drooling vegetable)
EDIT: All of the above is talking about the end state of a “get what you measure” failure. There could be a period, possibly decades where humans are still around, but things are going wrong in the way described.
This was helpful to me, thanks. I agree this seems almost certainly to be the end state if AI systems are optimizing hard for simple, measurable objectives.
I’m still confused about what happens if AI systems are optimizing moderately for more complicated, measurable objectives (which better capture what humans actually want). Do you think the argument you made implies that we still eventually end up with a universe tiled with molecular smiley faces in this scenario?
I think that this depends on how hard the AI’s are optimising, and how complicated the objectives are. I think that sufficiently moderate optimization for goals sufficiently close to human values will probably end up well.
I also think that optimisation is likely to end up at the physical limits, unless we know how to program an AI that doesn’t want to improve itself, and everyone makes AI’s like that.
Sufficiently moderate AI is just dumb, which is safe. An AI smart enough to stop people producing more AI, yet dumb enough to be safe seems harder.
There is also a question of what “better capturing what humans want” means. A utility function, that when restricted to the space of worlds roughly similar to this one, produces utilities close to the true human utility function, seems easy enough. Suppose we have defined something close to human well being. That definition is in terms of the level of various neurotransmitters near human DNA. Lets suppose this definition would be highly accurate over all history, and would make the right decision over nearly all current political issues. It could still fail completely in a future containing uploaded minds, and neurochemical vats.
Either your approximate utility function needs to be pretty close on all possible futures (even adversarially chosen ones) or you need to know that the AI won’t guide the future towards places that the utility functions differ.