The Ultimate Testing Grounds

A putative new idea for AI control; index here.

EDIT: the mathematics of the post can be found here.

In a previous post, I described a procedure for querying an AI about whether a certain outcome was desirable for it. When I did so, I didn’t realise the power of the idea.

In brief, that idea allows us to query an AI with utility v as to whether an outcome X or an outcome ¬X was the one it most desired in a particular situation. By judicious use of ‘false beliefs’, the AI would react as if it had miraculous powers to determine X vs ¬X in ways that no-one could ever know about.

Many (including Toby Ord) has suggested variants on the ‘honey pot’ idea: leave something available to the AI that looks like an oversight, but is actually a trap, to see if the AI goes beyond the parameters of what we’d expect. Maybe a wire to the internet that the AI is not supposed to use, but seems to be unobserved.

The problem with these ideas is that the honey pots had to be physically constructed, physically accessible to the AI, and the AI had to be wrong about whether it was a trap or not.

But using the method above, we can go much further. We can test the AI in any scenario which we can imagine, as long as we can precisely physically describe it. We can check what would happen if the AI was given the ability to launch the world’s nuclear arsenals, or divert an asteroid towards Earth. We can check what it would design if it was given unrestricted access to a DNA synthesiser. We could see what it would do if it could rewrite the text of a speech an important world leader was due to give. We could see whether it would sabotage the creation of other AIs—and if so, which ones. We could see what it did to a probe humanity planned to launch to the stars, what numbers it would call if it had access to a phone, what it would purchase if it had capital.

And, in all cases, the AI would be behaving as if its actions were undetectable and certain to succeed, while we ensured that they were (almost certainly) detectable and certain to fail. There is no “almost” in that certain to fail condition, because the AI is acting as if it had miraculous powers that it doesn’t have.

Now, we can’t use testing as our sole method of AI control (for reasons similar to here), but it does allow us some pretty effective ways of catching bad AIs.