Change the problem and you change the solution.
If we assume that Eli and Clippy are both essentially self-modifying programs capable of verifiably publishing their own source codes, then indeed they can cooperate:
Eli modifies his own source code in such a way that he assures Clippy that his cooperation is contingent on Clippy’s revealing his own source code and that the source code fulfills certain criteria, Clippy modifies his source code appropriately and publishes it.
Now each knows the other will cooperate.
But I think that although we in some ways resemble self-modifying computers, we cannot arbitrarily modify our own source codes, nor verifiably publish them. It’s not at all clear to me that it would be a good thing if we could. Eliezer has constructed a scenario in which it would be favorable to be able to do so, but I don’t think it would be difficult to construct a scenario in which it would be preferable to lack this ability.
What makes a problem seem not merely hard but impossible is that not only is there no clear way to go about finding a solution to the problem, there is a strong argument that there cannot be a solution to the problem. I can imagine a transhuman AI might eventually be able to convince me to let it out of a box (although I doubt a human could do it in two hours), but in some ways the AI in the game seems faced with a harder problem than a real AI would face: even if the gatekeeper is presented with an argument which would convince him to let an AI out, he is explicitly permitted by the rules to slip out of character and refuse to say the magic words purely in order to win the bet, wheras if the AI player were to break character and make consequentialist arguments that the Gatekeeper should publicly “lose” this is considered contrary to the spirit of the game.
But it seems to me to be much more useful to consider how the “box” protocol might be improved than to speculate how Eliezer won this particular game. How about this: as in the original conception, the AI is grown in an isolated computer and can only communicate with one human being through a terminal. That human is trying to get useful info out of the AI (cures for cancer, designs for fusion power plants, tips for how to get stains out of clothes without fading the colors, whatever). However, the person interacting with the AI is just a filter, he doesn’t have the power to “let the AI out”. The real experimenter (who in principle could let the AI but is convinced beforehand he should not) can at any time fire the filter person and purge the AI if he thinks the AI has gotten too much influence over the filter, and in fact will do that every now and then and regrow the AI purely as a precautionary measure.
Could this design be defeated? It seems to me that the combination of filter and purges should prevent the AI from learning what arguments would compel any individual experimenter from letting the AI out. I don’t think the AI could come up with any universally compelling argument, because I don’t think there is such a thing.