The map of “Levels of defence” in AI safety

One of the main principles of engineering safety is multilevel defence. When a nuclear bomb accidentally fell from the sky in the US, 3 of 4 defence levels failed. The last one prevented the nuclear explosion: https://​​en.wikipedia.org/​​wiki/​​1961_Goldsboro_B-52_crash

Multilevel defence is used a lot in the nuclear industry and includes different systems of passive and active safety, starting from the use of delayed neutrons for the reaction activation and up to control rods, containment building and exclusion zones.

Here, I present a look at the AI safety from the point of view of multilevel defence. This is mainly based on two of my yet unpublished articles: “Global and local solutions to AI safety” and “Catching treacherous turn: multilevel AI containment system”.

The special property of the multilevel defence, in the case of AI, is that the biggest defence comes from only the first level, which is AI alignment. Other levels have progressively smaller chances to provide any protection, as the power of self-improving AI will grow after it will break of each next level. So we may ignore all levels after AI alignment, but, oh Houston, we have a problem: based on the current speed of AI development, it seems that powerful and dangerous AI could appear within several years, but AI safety theory needs several decades to be created.

The map is intended to demonstrate a general classification principle of the defence levels in AI safety, but not to list all known ideas on the topic. I marked in “yellow” boxes, which are part of the plan of MIRI according to my understanding.

I also add my personal probability estimates as to whether each level will work (under the condition that AI risks are the only global risk, and previous levels have failed).

The principles of the construction of the map are similar to my “plan of x-risks prevention” map and my “immortality map”, which are also based around the idea of the multilevel defence.

pdf: https://​​goo.gl/​​XH3WgK