I think it is possible to prove that a given boxing works, if it’s sufficiently simple. Choosing the language isn’t enough, but choosing the interpreter should be.
Take Brainfuck for instance: replace the dot (.″), which prints a character, by two other statements: one that prints “yes” and exits, and one that prints “no” and exits. If the interpreter has no bug, a program can only:
Print “yes” and kill itself.
Print “no” and kill itself.
Do nothing until we kill it, or otherwise fail.
Assuming the AI doesn’t control the external world by heating the host Intel processor in a smart way, we should be able to prove that we’re otherwise safe.
I think it is possible to prove that a given boxing works, if it’s sufficiently simple. Choosing the language isn’t enough, but choosing the interpreter should be.
Take Brainfuck for instance: replace the dot (
.
″), which prints a character, by two other statements: one that prints “yes” and exits, and one that prints “no” and exits. If the interpreter has no bug, a program can only:Print “yes” and kill itself.
Print “no” and kill itself.
Do nothing until we kill it, or otherwise fail.
Assuming the AI doesn’t control the external world by heating the host Intel processor in a smart way, we should be able to prove that we’re otherwise safe.