The point is that it may be possible to design a heuristically friendly AI which, if friendly, will remain just as friendly after changing itself, without having any infallible way to recognize a friendly AI (in particular, its bad if your screening has any false positives at all, since you have a transhuman looking for pathologically bad cases).
To recap, (b) was: ‘Verification of “proof of friendliness” is easier than its production’.
For that to work as a plan in context, the verification doesn’t have to be infallible. It just needs not to have false positives. False negatives are fine—i.e. if a good machine is rejected, that isn’t the end of the world.
This is… unexpected. Mathematical theory, not to mention history, seems to be on my side here. How did you arrive at this conclusion?
Point taken. I was thinking of “unknown” in the sense of “designed with aim of friendliness, but not yet proven to be friendly”, but worded it badly.
The point is that it may be possible to design a heuristically friendly AI which, if friendly, will remain just as friendly after changing itself, without having any infallible way to recognize a friendly AI (in particular, its bad if your screening has any false positives at all, since you have a transhuman looking for pathologically bad cases).
To recap, (b) was: ‘Verification of “proof of friendliness” is easier than its production’.
For that to work as a plan in context, the verification doesn’t have to be infallible. It just needs not to have false positives. False negatives are fine—i.e. if a good machine is rejected, that isn’t the end of the world.