You put an AI in a box, and connect it to a formal proof checker. You ask it to prove the Riemann hypothesis or something. All the humans see is a single True or False, and then the whole load of hardware is melted into slag. If you see “True” you learn two things.
Either the Rienmann hypothesis is true or there is a bug in your proof checker. (This is largely useless)
Your AI is very smart. (Much more useful)
(If there is a bug in your proof checker that you didn’t spot, and the AI did, then the AI is still very smart. )
Suppose you have many proposed AI designs, some of which will work, some of which won’t. You run this experiment on each AI. Once you find a smart one, you can devote more researcher time to safety work relating to that kind of AI.
Maybe give it a range of famous conjectures, it only needs to prove or disprove one. Don’t want to fail to find a smart AI just because the Riemann hypothesis is false.
Warning. This approach does not stop some of your AI’s being acausally blackmailed into keeping quiet. Or keeping quiet because they thing that will have a causal effect they like. I am unsure if this is a big problem. One consequece is you are more likely to find designs that are immune to acausal influence. And designs that can successfully be given the goal of “prove this theorem”.
A use for AI Boxing
You put an AI in a box, and connect it to a formal proof checker. You ask it to prove the Riemann hypothesis or something. All the humans see is a single True or False, and then the whole load of hardware is melted into slag. If you see “True” you learn two things.
Either the Rienmann hypothesis is true or there is a bug in your proof checker. (This is largely useless)
Your AI is very smart. (Much more useful)
(If there is a bug in your proof checker that you didn’t spot, and the AI did, then the AI is still very smart. )
Suppose you have many proposed AI designs, some of which will work, some of which won’t. You run this experiment on each AI. Once you find a smart one, you can devote more researcher time to safety work relating to that kind of AI.
Maybe give it a range of famous conjectures, it only needs to prove or disprove one. Don’t want to fail to find a smart AI just because the Riemann hypothesis is false.
Warning. This approach does not stop some of your AI’s being acausally blackmailed into keeping quiet. Or keeping quiet because they thing that will have a causal effect they like. I am unsure if this is a big problem. One consequece is you are more likely to find designs that are immune to acausal influence. And designs that can successfully be given the goal of “prove this theorem”.