the sort of proof techniques I currently have in mind … would not work for verifying Friendliness of a finished AI that was handed you by a hostile superintelligence.
But what if the hostile superintelligence handed you a finished AI together with a purported proof of its Friendliness. Would you have enough trust in the soundness of your proof system to check the purported proof and act on the results of that check?
That would then be something you’d have to read and likely show to dozens of other people to verify reliably, leaving opportunities for all kinds of mindhacks. the OP proposal requires us to have an automatic verifier ready to run, that can return reliably without human intervention.
Yes, but the point is that the automatic verifier gets to verify a proof that the AI-in-the-box produced—it doesn’t have to examine an arbitrary program and try to proof friendliness from scratch.
In a comment below, paulfchristiano makes the point that any theory of friendliness at all should give us such a proof system, for some restricted class of programs. For example, Eliezer envisions a theory about how to let programs evolve without losing friendliness. The corresponding class of proofs have the form “the program under consideration can be derived from the known-friendly program X by the sequence Y of friendliness-preserving transformations”.
But what if the hostile superintelligence handed you a finished AI together with a purported proof of its Friendliness. Would you have enough trust in the soundness of your proof system to check the purported proof and act on the results of that check?
That would then be something you’d have to read and likely show to dozens of other people to verify reliably, leaving opportunities for all kinds of mindhacks. the OP proposal requires us to have an automatic verifier ready to run, that can return reliably without human intervention.
Actually computers can mechanically check proofs for any formal system.
Is there something missing from the parent? It does not seem to parse.
Yes, edited. thanks.
And upvoted. :)
Yes, but the point is that the automatic verifier gets to verify a proof that the AI-in-the-box produced—it doesn’t have to examine an arbitrary program and try to proof friendliness from scratch.
In a comment below, paulfchristiano makes the point that any theory of friendliness at all should give us such a proof system, for some restricted class of programs. For example, Eliezer envisions a theory about how to let programs evolve without losing friendliness. The corresponding class of proofs have the form “the program under consideration can be derived from the known-friendly program X by the sequence Y of friendliness-preserving transformations”.