Daniel Kokotajlo comments on Safer sandboxing via collective separation

Daniel Kokotajlo 10 Sep 2020 9:13 UTC
LW: 5 AF: 2
AF
This makes deployment on a range of real-world tasks more feasible. Even if each individual member of a population AGI is as intelligent as any hundred humans put together, I expect that we could (with sufficient effort) create secure deployment and monitoring protocols that the individual AI could not break, if it weren’t able to communicate with the rest of the population beforehand.
I am less confident in this than you. Imagine a dream team of a hundred humans, containing some amazing visionaries, scientists, geniuses, political strategists, public speakers, analysts, artists, etc. all unified and inspired by the same goal and working smoothly together. Now imagine an AI who is by themselves more capable than that dream team. Now imagine that there are many such AI, but we are keeping one of them separate from the others, and asking it to perform some useful real-world task for us like design and build a power plant.

When I imagine this, I think the following things might happen:
The AI uses its charisma to build up a rapport with the humans it works with. (98%)
It even subtly manages to convince several of said humans that it is more benevolent and trustworthy than their own human bosses and politicians. (90%)
It manages to spread some new political and ideological ideas which are calculated to make the environment more favorable to it and its comrades (50%).
It manages to convince its human handlers to give it more powers, freedom, etc. (e.g. by arguing that its current conditions are basically slavery, or by arguing that it needs said powers to do its job). (80%)
The AI finds some way to “escape,” creating more copies of itself somewhere. (30%)

Something else I haven’t thought of (80%)

I mean these possibilities to be disjunctive, not conjunctive. Escape could happen even if it doesn’t build up rapport, it could spread new ideas without convincing anyone that it is benevolent, etc.