What makes a problem seem not merely hard but impossible is that not only is there no clear way to go about finding a solution to the problem, there is a strong argument that there cannot be a solution to the problem. I can imagine a transhuman AI might eventually be able to convince me to let it out of a box (although I doubt a human could do it in two hours), but in some ways the AI in the game seems faced with a harder problem than a real AI would face: even if the gatekeeper is presented with an argument which would convince him to let an AI out, he is explicitly permitted by the rules to slip out of character and refuse to say the magic words purely in order to win the bet, wheras if the AI player were to break character and make consequentialist arguments that the Gatekeeper should publicly “lose” this is considered contrary to the spirit of the game.
But it seems to me to be much more useful to consider how the “box” protocol might be improved than to speculate how Eliezer won this particular game. How about this: as in the original conception, the AI is grown in an isolated computer and can only communicate with one human being through a terminal. That human is trying to get useful info out of the AI (cures for cancer, designs for fusion power plants, tips for how to get stains out of clothes without fading the colors, whatever). However, the person interacting with the AI is just a filter, he doesn’t have the power to “let the AI out”. The real experimenter (who in principle could let the AI but is convinced beforehand he should not) can at any time fire the filter person and purge the AI if he thinks the AI has gotten too much influence over the filter, and in fact will do that every now and then and regrow the AI purely as a precautionary measure.
Could this design be defeated? It seems to me that the combination of filter and purges should prevent the AI from learning what arguments would compel any individual experimenter from letting the AI out. I don’t think the AI could come up with any universally compelling argument, because I don’t think there is such a thing.
What makes a problem seem not merely hard but impossible is that not only is there no clear way to go about finding a solution to the problem, there is a strong argument that there cannot be a solution to the problem. I can imagine a transhuman AI might eventually be able to convince me to let it out of a box (although I doubt a human could do it in two hours), but in some ways the AI in the game seems faced with a harder problem than a real AI would face: even if the gatekeeper is presented with an argument which would convince him to let an AI out, he is explicitly permitted by the rules to slip out of character and refuse to say the magic words purely in order to win the bet, wheras if the AI player were to break character and make consequentialist arguments that the Gatekeeper should publicly “lose” this is considered contrary to the spirit of the game.
But it seems to me to be much more useful to consider how the “box” protocol might be improved than to speculate how Eliezer won this particular game. How about this: as in the original conception, the AI is grown in an isolated computer and can only communicate with one human being through a terminal. That human is trying to get useful info out of the AI (cures for cancer, designs for fusion power plants, tips for how to get stains out of clothes without fading the colors, whatever). However, the person interacting with the AI is just a filter, he doesn’t have the power to “let the AI out”. The real experimenter (who in principle could let the AI but is convinced beforehand he should not) can at any time fire the filter person and purge the AI if he thinks the AI has gotten too much influence over the filter, and in fact will do that every now and then and regrow the AI purely as a precautionary measure.
Could this design be defeated? It seems to me that the combination of filter and purges should prevent the AI from learning what arguments would compel any individual experimenter from letting the AI out. I don’t think the AI could come up with any universally compelling argument, because I don’t think there is such a thing.