Plan B in AI Safety approach
Central claim: Ambitious alignment of superintelligent AI will not be solved in time, therefore different variants of plan B should be explored and implemented.
Below is a short overview and links to my works on AI safety. The central theme of all them is “plan B”. Thus it could be called the “plan B agenda”.
Ambitious alignment of arbitrary superintelligent AI can’t be solved before AGI creation: a) proof is too complex and unreliable b) it needs at least superintelligent AI safety team to be created с) Dangerous AI may appear soon – 2030. Manydifferenttypes of AI-related x-risks are possible, not only paperclipper.
The best available solution is limiting the power of AI: artificial stupidity, a limited number of use cases, internal surveillance, multilevelboxing.
Safer AI architecture is needed: data-driven AI and human-like functional architecture for interpretability. Human-like AI could be modelled around a real-world person: it is not an upload, but very similar to side-loading. E.g. trained on someone’s texts.
Not only local but global solution of AI safety is needed: AI-nanny based on global surveillance and empowered by agreement between nuclear powers. Human-like AI or a first upload could be a “virtual tzar” of this.
The nature of human values depends on who is “moral subject”: society, brain, personality or subpersonality. Human values are not stable preferences. “Real choices” are more important than values, as “values” are notalways useful philosophical construction. Also, meta-values are more important: life extension, personal universes, – as they give anyone freedom to have their own values.
An AI that is learning human values is dangerous as it doesn’t know that values yet.
Unfriendly AI will not be very much interested to kill humans for atoms, as atoms have very small instrumental value, and living humans have larger instrumental value on all stages of AI’s evolution. But we have to help to understand this for future AI, especially for “Young AI”. We could empower our position by making some precomitments.
My works on AI Safety:
AI safety defense levels, a roadmap
Catching Treacherous Turn: A Model of the Multilevel AI Boxing
Catastrophically Dangerous AI is Possible Before 2030
AI Alignment Problem: “Human Values” don’t Actually Exist
First human upload as AI Nanny
Artificial Intelligence in Life Extension: from Deep Learning to Superintelligence
Military AI as a Convergent Goal of Self-Improving AI
Classification of Global Catastrophic Risks Connected with Artificial Intelligence
The Global Catastrophic Risks Connected with Possibility of Finding Alien AI During SETI
Classification of the Global Solutions of the AI Safety Problem.
Levels of Self-Improvement in AI and their Implications for AI Safety
“Possible Dangers of the Unrestricted Value Learners” – LW post
Humans could be a threat though—ie. by building another AI with different values.
There is a very short period of time when humans are a threat and thus are needed to be exterminated: it is before AI reach the level of superintellignet omnipotence, but after that AI is already capable to cause a human extinction.
Superintelligent AI could prevent creation of other AIs by surveillance via some nanotech. So if AI mastered nanotech, it doesn’t need to exterminate humans for own safety. So only an AI before nanotech may need to exterminate humans. But how? It could create a biological virus, which is simpler than nanotech, but the problem is that such Young AI depends yet on human-built infrastructure, like electricity, so exterminating humans before nanotech is not a good idea.
I am not trying to show innate AI safety here, I just want to point that extermination of humans is not a convergent goal for AI. There are still many ways how AI could go wrong and kill all us.
At the nanotech stage, the AI can turn any atoms into really good robots. Self replication ⇒ Exponential growth ⇒ Limiting factor quickly becomes atoms and energy. If the AI is just doing self replication and paperclip production, humans aren’t useful workers compared to nanotech robots. (Also, the AI will probably disassemble the earth. At this stage, it has to make O’Neil cylinders, nanotech food production etc to avoid wiping out humanity.)
I think that there is an common fallacy that superintelligent AI risks are perceived as grey goo risks.
The main difference is that AI thinks strategically on very long distances and takes even small possibilities into account.
If AI is going to create as much paperclip as possible, then what it cares about is only its chances of colonise the whole universe and even survive the end of the universe. These chances negligibly affected by the amount of atoms on Earth, but strongly depend on AI’s chances to meet other aliens eventually. Other aliens may have different values systems and some of them will be friendly to their creators. Such future AIs will be not happy to learn that Paperclipper destroyed humans and will not agree to make more paperclips. Bostrom explored similar ideas in “Hail Mary and Value Porosity”
TL;DR: it is instrumentally reasonable to preserve humans as they could be traded with alien AIs. Human atoms have very small instrumental value.
FYI, there’s a lot of links that don’t work here. “multilevel boxing,” “AI-nanny,” “Human values,” and so on.
Thanks, it looks like they died during copy-pasting.
I agree, the “useful atoms” scenario is not the only possible one. Some alternatives:
convert all matter in our light cone into happy faces
convert the Earth into paperclips, and then research how to prevent the heat death of the Universe, to preserve the precious paperclips
confine humans in a simulation, to keep them as pets / GPU units / novelty creators
make it impossible for humans to ever create another AGI; then leave the Earth
kill everyone except Roko
kill everyone outside China
become too advanced for any interest in such clumsy things as atoms
avoid any destruction, and convert itself into a friendly AI, because it’s a rational thing to do.
The point is, the unfriendly AI will have many interesting options of how to deal with us. And not every option will make us extinct.
It is hard to predict the scale of destruction, as it is hard for this barely intelligent ape to predict the behavior of a recursively self-improving Bayesian superintelligence. But I guess that the scale of destruction depends on:
the AI’s utility function
the AI’s ability to modify its utility function
the risk of humans creating another AI
the risk of the AI still being in a simulation where the creators evaluate its behavior
whatever the AI is reading LessWrong and taking notes
various unknowns
So, there might be a chance that the scale of destruction will be small enough for our civilization to recover.
How can we utilize this chance?
1. Space colonization
There is some (small?) chance that the destruction will be limited to the Earth.
So, colonizing Mars / the Moon / asteroids is an option.
But it’s unclear how much of our resources should be allocated for that.
In the ideal world, alignment research should get orders of magnitude more money than space colonization. But in the same ideal world, the money allocated for space colonization could be in trillions USD.
2. Mind uploading
With mind uploading, we could transmit out minds into outer space, with the hope that some day the data will be received by someone out there. No AGI can stop it, as the data will be propagated at the speed of light.
3. METI
If we are really confident that the AGI will kill us all, why not call for help?
We can’t become extinct twice. So, if we are already doomed, we can as well do METI.
If an advanced interstellar alien civilization comes to kill us, the result will be the same: extinction.
But if it comes to rescue, it might help us with AI alignment.
4. Serve the Machine God
(this point might be not entirely serious)
In deciding your fate, the AGI might consider:
- if you’re more useful than the raw materials you’re made of
- if you pose any risk to its existence
So, if you are a loyal minion of the AGI, you are much more likely to survive.
You know, only metal endures.
This chance is basically negligible, unless you made earth a special case in the AI’s code. But then you could make one room a special case by changing a few lines of code.
Probably no aliens anywhere near. (Fermi paradox) Human minds = Lots of data = Hard to transmit far.
The AI can chase after the signals, so we get a few years running on alien computers before the AI destroys those, compared to a few years on our own computers.
Runs risk of evil aliens torturing humans. Chance FTL is possible.
If advanced aliens care, they could know all about us without our radio signals. We can’t hide from them. They will ignore us for whatever reason they are currently ignoring us.
With nanotech, the AI can trivially shape those raw materials into a robot that doesn’t need food or sleep. A machine it can communicate to with fast radio, not slow sound waves. A machine that always does exactly what you want it to. A machine that is smarter, more reliable, stronger, more efficient, and more suited to the AI’s goal. You can’t compete with AI designed nanotech.
Some of risks are “instrumental risks” like “the use of human atoms”, and other are “final goal risks”, like “cover universe with smily faces”. If final goal is something like smily faces, the AI can still preserve some humans for instrumental goals, like research the types of smiles or trade with aliens.
if some humans are preserved instrumentally, they could live better lives than we now and even be more numerous, so it is not extinction risk. Most humans who live now here are instrumental to states and corporations, but still get some reward.