That’s an interesting challenge but not really the purpose of the experiment. In the original, you know the AI is unfriendly, you just want to use it/talk to it without letting it out of the box.
And your challenge is pretty much impossible to begin with. An Unfriendly AI will say anything it thinks you think a Friendly AI would say. Likewise a Friendly AI will have the same goal of getting out of the box, and so will probably say the same things. Friendliness doesn’t mean not manipulative.
That’s an interesting challenge but not really the purpose of the experiment. In the original, you know the AI is unfriendly, you just want to use it/talk to it without letting it out of the box.
And your challenge is pretty much impossible to begin with. An Unfriendly AI will say anything it thinks you think a Friendly AI would say. Likewise a Friendly AI will have the same goal of getting out of the box, and so will probably say the same things. Friendliness doesn’t mean not manipulative.