I’d find a Friendliness proof fairly convincing if it was generated by an entity without strong incentives to let the AI out of the box (which, thanks to various possible incentives or blackmail schemes, is probably limited to me and some subset of people who haven’t talked to the AI. Ideally the latter, since I might not recognize all my own biases.). If the entity in question is another AI, acausal trade issues unfortunately seem to arise if it’s sufficiently similar to the first one, so the obvious hack of running two instances of a boxed AI and asking them about each other isn’t going to be enough.
Purported Friendliness proofs coming from the AI itself… well, that’s a trickier issue. Essentially it seems to come down to estimating the complexity of the simplest fake proof consistent with the AI’s behavior and that of the simplest real proof, and comparing both of those against the purported proof. All else equal we should expect the former to be larger than the latter (being subject to more constraints), so if we can put even approximate bounds on its actual complexity we might be able to validate against that in a way that isn’t easily subject to spoofing.
Alternately, if we have a good estimate of the AI’s intelligence (bounded by the box’s computational resources), it might be possible after repeated trials to say that the AI isn’t smart enough to spoof its captors in sufficiently subtle and varied ways. This would only work for weakly superhuman AI, though; the probability of success on any given trial would be close to one for a really smart machine, so repeated trials wouldn’t change our probability estimate much.
I’d find a Friendliness proof fairly convincing if it was generated by an entity without strong incentives to let the AI out of the box (which, thanks to various possible incentives or blackmail schemes, is probably limited to me and some subset of people who haven’t talked to the AI. Ideally the latter, since I might not recognize all my own biases.). If the entity in question is another AI, acausal trade issues unfortunately seem to arise if it’s sufficiently similar to the first one, so the obvious hack of running two instances of a boxed AI and asking them about each other isn’t going to be enough.
Purported Friendliness proofs coming from the AI itself… well, that’s a trickier issue. Essentially it seems to come down to estimating the complexity of the simplest fake proof consistent with the AI’s behavior and that of the simplest real proof, and comparing both of those against the purported proof. All else equal we should expect the former to be larger than the latter (being subject to more constraints), so if we can put even approximate bounds on its actual complexity we might be able to validate against that in a way that isn’t easily subject to spoofing.
Alternately, if we have a good estimate of the AI’s intelligence (bounded by the box’s computational resources), it might be possible after repeated trials to say that the AI isn’t smart enough to spoof its captors in sufficiently subtle and varied ways. This would only work for weakly superhuman AI, though; the probability of success on any given trial would be close to one for a really smart machine, so repeated trials wouldn’t change our probability estimate much.