Yes, but the point is that the automatic verifier gets to verify a proof that the AI-in-the-box produced—it doesn’t have to examine an arbitrary program and try to proof friendliness from scratch.
In a comment below, paulfchristiano makes the point that any theory of friendliness at all should give us such a proof system, for some restricted class of programs. For example, Eliezer envisions a theory about how to let programs evolve without losing friendliness. The corresponding class of proofs have the form “the program under consideration can be derived from the known-friendly program X by the sequence Y of friendliness-preserving transformations”.
Yes, but the point is that the automatic verifier gets to verify a proof that the AI-in-the-box produced—it doesn’t have to examine an arbitrary program and try to proof friendliness from scratch.
In a comment below, paulfchristiano makes the point that any theory of friendliness at all should give us such a proof system, for some restricted class of programs. For example, Eliezer envisions a theory about how to let programs evolve without losing friendliness. The corresponding class of proofs have the form “the program under consideration can be derived from the known-friendly program X by the sequence Y of friendliness-preserving transformations”.