Another formalization attempt: Central Argument That AGI Presents a Global Catastrophic Risk
Chalmers recently requested a formalized argument why AI is catastrophic risk. Here is my attempt to write it (not polished).
Short version
1. Future AI will be very powerful soon.
2. Future AI systems will likely be not exactly aligned with human values.
3. Harm from AI is proportional to (capabilities)x(misalignment).
4. Based on 1, 2 and 3 we can conclude that AI will cause catastrophic harm soon.
5. Catastrophic harm is an existential risk to humanity, and it for sure includes disempowerment, and likely includes extinction or eternal sufferings.
6. There are many ways how AI can cause catastrophic harm.
More detailed version
1) Future AI will be very powerful soon because of Moore’s law, the stream of new ideas and global and local self-improvement. This will happen relatively soon, because of the exponential nature of the global self-improving process. Also, because global AI arms race will favour such dynamics.
Assumptions:
a) Powerful AI is possible.
b) Powerful AI will appear relatively soon before AI control theory and practice will be developed.
2) Future AI systems will likely be not exactly aligned with human values, as we don’t know what is human values, and how to instil any values at all in AI, and there are several other reasons (whose values we installing, internal misalignment).
Lemma: AI Alignment is difficult
(Needs to be proved separately, see below)
3) Harm from AI is proportional to (capabilities)x(misalignment).
The relation is not necessarily linear as capabilities growth can cause misalignment (ontological crisis, sharp left turn).
Here is also assumed that AI will be used for actions. If AI is not acting, its capabilities are not dangerous themselves.
4) Based on 1, 2 and 3 we can conclude that AI will cause catastrophic harm soon.
In the equation
Harm = (capabilities)x(misalignment)
The capability part will grow very large based on (1) and misalignment will be not zero based on (2). As AI’s capabilities can grow infinitely large, harm can reach the maximum possible level.
5. Catastrophic harm is an existential risk to humanity, and its main form is disempowerment, though extinction or eternal sufferings are possible outcomes.
Typically maximum possible harm is defined as existential risk, which could take the form of:
- human extinction,
- s-risk (eternal suffering, but some humans survive)
- disempowerment (some humans will survive but don’t control their fate).
Note that disempowerment is a prerequisite for any catastrophic harm, as it means that humans can’t resist bad things.
6. There are many ways how AI can cause catastrophic harm. (AI used as a tool, Paperclipper, wrongly aligned AI, a war between AIs, AI halting).
There is no necessity that (Singleton) AI will kill all humans, but it still can happen. AI will kill humans for two reasons:
A) to prevent a threat to itself. But AI can kill humans only after it develops human-independent robotic infrastructure (HIRI), presumably based on nanotech. But humans can’t destroy nanotech, so there is no reason to kill humans either before not after HIRI.
B) for some marginal utilitarian reasons: for atoms or via ecology damage
There are several reasons why the utility of preserving humans may turn higher than human atoms: acausal trade with aliens, humans in simulations, and humans doing some heavy work. But the relation between utilities is fragile and uncertain.
Lemma: Why alignment is hard
Definition: “Alignment” is a form of equivalence between two similar objects, e.g. parallel lines.
Definition: “AI alignment” is a form of equivalence between the human value system and the goal system of AI.
Alignment becomes more difficult when two objects are less similar, more remote and less linear, e.g. mouse and clouds.
Humans and advanced AI are remote, not similar and not linear objects, and thus alignment is difficult:
a) Humans’ non-linearity. Human values are fuzzy concept, more like a cloud (complexity of values)
b) AI quantitative distance from humans. As AI will become superhuman, it will be more distant from humans (sharp left turn). It is difficult to align objects of different scales.
c) AI qualitative distance from humans. As AI thinks differently than humans, it could act differently in new situations.
A useful model are companion pets.
Humans use many species of animals, but only a few of them as companions: dogs, horses, a few more.
A human can build a team with a dog to pursue shared goals (e.g. hunting). But a human can’t build such a team with a snail.
In the value system of chimpanzees, it seems to be acceptable to bite off fingers of someone who wronged them. In spite of our genetic closeness, chimpanzees are not suitable companions (but probably can be modified to become suitable).
Dogs are far different from humans than chimpanzees, but were selected for +50k years to be good companions, starting from an animal already mostly suitable for that due to an evolutionary coincidence. The result is a rather strong alignment of values, but still far from perfect (dogs sometimes kill humans).
I’m not aware of any companion animal that is not a mammal. Perhaps some falcons?
In general, animal companionship seems to require a sufficient biological similarity, a mind capable of mutual understanding, and at least some minimal alignment of values.
Falconry is a thing, but I wouldn’t necessarily consider them model companion birds.
Many birds do bond with us, though they might not be recommended companions due to complications of care. But birds that have sufficient similarity, mutual understanding and alignment of values to form a team likely includes but is not limited to crows, parrots, and pigeons. Pigeons, of course, have been livestock, but also messengers and pets. Many more kinds of birds have social inclinations that may cause them to interpret a human as a flock member, friend, parent, or someone to fall in love with (even if we don’t have any specific shared goals to team up on).
Good example with dogs. But here we aligned simpler minds by very costly evolutionary process. Can’t be done with more complex minds.
This does look to me like a good formalization of the standard argument, and so this formalization makes it possible to analyze the weaknesses of the standard argument.
The weak point here seems to be “Harm from AI is proportional to (capabilities)x(misalignment)”, because the argument seems to implicitly assume the usual strong definition of alignment: “Future AI systems will likely be not exactly aligned with human values”.
But, in reality, there are vital aspects of alignment (perhaps we should start calling them partial alignment), such as care about well-being and freedom of humans, and only the misalignment with those would cause harm (whereas some human values, such as those leading to widespread practice of factory farming and many others, better be skipped and not aligned to, because they would lead to disaster when combined with increased capabilities).
The Lemma does not apply to partial alignment.
It is true than we don’t know how to safely instill arbitrary values into advanced AI systems (and that might be a good thing, because arbitrary values can be chosen in such a way that they cause plenty of harm).
However, some values might be sufficiently invariant to be natural for some versions of AI systems. E.g. it might turn out that care about “well-being and freedom of all sentient beings” is natural for some AI ecosystems (e.g. one can make an argument that for such AI ecosystems which include persistent sentiences within the AI ecosystem in question, the “well-being and freedom of all sentient beings” might become a natural value and goal).