as well as my point that if the AI is friendly then my existing proof of friendliness was apparently sufficient, and if the AI is unfriendly then it’s just a trick, so “a proof of my own friendliness” doesn’t seem like useful evidence.
Huh. I thought the AI Box experiment assumed that Friendliness is intrinsically unknown; that is, it’s not presumed the AI was designed according to Friendliness as a criterion.
If the AI is friendly, then the technique I am using already produces a friendly AI, and I thus learn nothing more than how to prove that it is friendly.
But if the AI is unfriendly, the proof will be subtly corrupt, so I can’t actually count the proof as any evidence of friendliness, since both a FAI and UFAI can offer me exactly the same thing.
Huh. I thought the AI Box experiment assumed that Friendliness is intrinsically unknown; that is, it’s not presumed the AI was designed according to Friendliness as a criterion.
If the AI is friendly, then the technique I am using already produces a friendly AI, and I thus learn nothing more than how to prove that it is friendly.
But if the AI is unfriendly, the proof will be subtly corrupt, so I can’t actually count the proof as any evidence of friendliness, since both a FAI and UFAI can offer me exactly the same thing.