It’s hard to come up with a workably short function that includes human civilization in its global maximum/minimum.
This is a problem because we specify our goals to AI using functions. For any practical function we might use, there’s a set of strange and undesirable worlds that satisfy that function better than our world.
For example, if our function measures the probability that some particular glass is filled with water, the space near the maximum is full of worlds like “take over the galaxy and find the location least likely to be affected by astronomical phenomena, then build a megastructure around the glass designed to keep it full of water”.
On top of this, what the AI would actually do if we trained it on that function is hard to predict, because it would learn values that instrumentally help it in our training data but not necessarily in new environments. This would result in strange, unintuitive values whose maximum is equally if not more unlikely to include human civilization.
Here I’ll list some potential ideas that might come to mind, along with why I think they’re insufficient:
Idea:
Don’t specify our goals to AI using functions.
Flaw:
Current deep learning methods use functions to measure error, and AI learns by minimizing that error in an environment of training data. This has replaced the old paradigm of symbolic AI, which didn’t work very well. If progress continues in this direction, the first powerful AI will operate on the principles of deep learning.
Even if we build AI that doesn’t maximize a function, it won’t be competitive with AI that does, assuming present trends hold. Building weaker, safer AI doesn’t stop others from building stronger, less safe AI.
Idea:
Use long, complicated functions that represent our actual goals.
Flaw:
This is even more difficult than it sounds. It’s hard to specify a goal like “don’t affect anything except this glass and this pitcher of water” using a function. Every action that fills the glass with water also affects everything within the local causal sphere of influence, although in very minor ways. “Only affect everything else a little” has very strange worlds near its global maximum. “Only use T seconds to form a plan to fill the glass with water” could result in a plan like “copy paste my code without the planning time limit”.
Also, the longer our function gets, the harder it is to verify that the space near the global maximum is desirable. A complicated function only needs to be wrong in one way for the maximum to be undesirable.
Idea:
Use deep learning methods to generate the function.
Flaw:
There’s no reliable feedback mechanism to test the function outside of its training data. The function we get might be successful in the training environment, but then produce weird results in the real world.
This is okay for tools that don’t have to be 100% error-free, but for a powerful optimizer, a small discrepancy in the goal function can result in strange and undesirable maxima.
Idea:
Involve humans heavily in the process.
Flaw:
Once the AI is powerful enough, any preferences it has will be manifested in the world. Human involvement only keeps it safe as long as it’s weak enough to be controlled.
For a powerful AI, even a function like “listen to what humans tell you” results in strange and undesirable global maxima. We would need an extremely reliable value learning function for this process to be safe. Teaching an AI to learn our values successfully might be even harder than teaching an AI our values directly.
Idea:
Get multiple AIs to prevent each other from maximizing their goal functions.
Flaw:
The global maximum of any set of functions like this still doesn’t include human civilization. Either a single AI will win, or some subset will compete among themselves with just as little regard for preserving humanity as the single AI would have.
Idea:
Don’t build powerful AI.
Flaw:
Technological progress is a highly decentralized process which is largely outside the control of any individual or group. Multiple groups are trying to develop powerful AI simultaneously, and more are likely to join in the future. Once it exists, powerful AI is likely to be much easier to generate or copy than historical examples of dangerous technologies like nuclear weapons.
Also, actors with higher risk tolerance or lower risk estimates are incentivized to develop powerful AI first. This makes the strategy of abstaining from AI development an unappealing option for most companies and governments.
Idea:
Build one AI just strong enough to stop others from being built.
Flaw:
The goal “stop other AI from being built” is not likely to be safer than the goal “fill the glass with water”. Both have undesirable worlds near the global maximum.
The key element of this idea might be “just strong enough”, but it’s unclear if this is feasible. It seems contradictory for an AI to be powerful enough to stop other AI from being built, but not powerful enough to produce undesirable worlds.
It’s conceivable that there exists some limited set of capabilities that meets these criteria, but none come to mind. It’s also conceivable that none exist.
Idea:
Enhance human cognitive abilities before the arrival of powerful AI.
Flaw:
We don’t know how to do this. So far, attempts to produce even minor cognitive improvements have been unsuccessful.
AI technology is improving much faster than human cognitive enhancement technology.
...
I don’t know of any solutions that aren’t flawed in this way.
Most Functions Have Undesirable Global Extrema
(Crossposted from my blog Centerless Set)
It’s hard to come up with a workably short function that includes human civilization in its global maximum/minimum.
This is a problem because we specify our goals to AI using functions. For any practical function we might use, there’s a set of strange and undesirable worlds that satisfy that function better than our world.
For example, if our function measures the probability that some particular glass is filled with water, the space near the maximum is full of worlds like “take over the galaxy and find the location least likely to be affected by astronomical phenomena, then build a megastructure around the glass designed to keep it full of water”.
On top of this, what the AI would actually do if we trained it on that function is hard to predict, because it would learn values that instrumentally help it in our training data but not necessarily in new environments. This would result in strange, unintuitive values whose maximum is equally if not more unlikely to include human civilization.
Here I’ll list some potential ideas that might come to mind, along with why I think they’re insufficient:
Idea:
Don’t specify our goals to AI using functions.
Flaw:
Current deep learning methods use functions to measure error, and AI learns by minimizing that error in an environment of training data. This has replaced the old paradigm of symbolic AI, which didn’t work very well. If progress continues in this direction, the first powerful AI will operate on the principles of deep learning.
Even if we build AI that doesn’t maximize a function, it won’t be competitive with AI that does, assuming present trends hold. Building weaker, safer AI doesn’t stop others from building stronger, less safe AI.
Idea:
Use long, complicated functions that represent our actual goals.
Flaw:
This is even more difficult than it sounds. It’s hard to specify a goal like “don’t affect anything except this glass and this pitcher of water” using a function. Every action that fills the glass with water also affects everything within the local causal sphere of influence, although in very minor ways. “Only affect everything else a little” has very strange worlds near its global maximum. “Only use T seconds to form a plan to fill the glass with water” could result in a plan like “copy paste my code without the planning time limit”.
Also, the longer our function gets, the harder it is to verify that the space near the global maximum is desirable. A complicated function only needs to be wrong in one way for the maximum to be undesirable.
Idea:
Use deep learning methods to generate the function.
Flaw:
There’s no reliable feedback mechanism to test the function outside of its training data. The function we get might be successful in the training environment, but then produce weird results in the real world.
This is okay for tools that don’t have to be 100% error-free, but for a powerful optimizer, a small discrepancy in the goal function can result in strange and undesirable maxima.
Idea:
Involve humans heavily in the process.
Flaw:
Once the AI is powerful enough, any preferences it has will be manifested in the world. Human involvement only keeps it safe as long as it’s weak enough to be controlled.
For a powerful AI, even a function like “listen to what humans tell you” results in strange and undesirable global maxima. We would need an extremely reliable value learning function for this process to be safe. Teaching an AI to learn our values successfully might be even harder than teaching an AI our values directly.
Idea:
Get multiple AIs to prevent each other from maximizing their goal functions.
Flaw:
The global maximum of any set of functions like this still doesn’t include human civilization. Either a single AI will win, or some subset will compete among themselves with just as little regard for preserving humanity as the single AI would have.
Idea:
Don’t build powerful AI.
Flaw:
Technological progress is a highly decentralized process which is largely outside the control of any individual or group. Multiple groups are trying to develop powerful AI simultaneously, and more are likely to join in the future. Once it exists, powerful AI is likely to be much easier to generate or copy than historical examples of dangerous technologies like nuclear weapons.
Also, actors with higher risk tolerance or lower risk estimates are incentivized to develop powerful AI first. This makes the strategy of abstaining from AI development an unappealing option for most companies and governments.
Idea:
Build one AI just strong enough to stop others from being built.
Flaw:
The goal “stop other AI from being built” is not likely to be safer than the goal “fill the glass with water”. Both have undesirable worlds near the global maximum.
The key element of this idea might be “just strong enough”, but it’s unclear if this is feasible. It seems contradictory for an AI to be powerful enough to stop other AI from being built, but not powerful enough to produce undesirable worlds.
It’s conceivable that there exists some limited set of capabilities that meets these criteria, but none come to mind. It’s also conceivable that none exist.
Idea:
Enhance human cognitive abilities before the arrival of powerful AI.
Flaw:
We don’t know how to do this. So far, attempts to produce even minor cognitive improvements have been unsuccessful.
AI technology is improving much faster than human cognitive enhancement technology.
...
I don’t know of any solutions that aren’t flawed in this way.