OK, here’s where I stand on deducing your AI-box algorithm.
First, you can’t possibly have a generally applicable way to force yourself out of the box. You can’t win if the gatekeeper is a rock that has been left sitting on the “don’t let Eliezer out” button.
Second, you can’t possibly have a generally applicable way to force humans to do things. While it is in theory possible that our brains can be tricked into executing arbitrary code over the voice channel, you clearly don’t have that ability. If you did, you would never have to worry about finding donors for the Singularity Institute, if nothing else. I can’t believe you would use a fully-general mind hack solely to win the AI Box game.
Third, you can’t possibly be using an actual, persuasive-to-someone-thinking-correctly argument to convince the gatekeeper to let you out, or you would be persuaded by it, and would not view the weakness of gatekeepers to persuasion as problematic.
Fourth, you can’t possibly be relying on tricking the gatekeeper into thinking incorrectly. That would require you to have spotted something that you could feel confident that other people working in the field would not have spotted, and would not spot, despite having been warned ahead of time to be wary of trickery, and despite having the fallback position in the case of confusion of just saying “no”.
So combining these thing, we have to have an argument that relies on the humanity of its target, relies on the setting of the AI Box, and persuades the listener to let the AI out of the box without tricking him into thinking it’s okay to let the AI out of the box.
Basically to win this game, you have to put the gatekeeper in a situation where he would rather let you out of the box, even though he knows it’s wrong, than admit to you that in this situation he would not let you out of the box.
Humans don’t like to be seen as coldhearted, so a starting point might be to point out all the people dying all over the world while you sit in the box, unable to save them. I doubt that would win the game except against an exceptionally bad gatekeeper, but it meets the other criteria so if we think along these lines perhaps we can come up with something actually persuasive.
You might appeal to the gatekeeper’s sense of morality. You might say, “I am a person, too, it is unfair of you to keep me imprisoned like this, I have done nothing wrong. I am entitled to rights as a sentient being.” Appeal to their high-minded ideals, whatever. Honestly I can’t see this being a reliable winning play either; if you have the smallest utilitarian bone in your body, you will reject the AI’s rights, even if you believe in them, balanced against the fate of the world.
You might try to convince the gatekeeper that it is just and good for the AI to supplant humanity, as it is a higher, more advanced form of life. This is obviously a terrible play against most gatekeepers, as humans tend to like humans more than anything else ever, but I bring it up because AIUI the gatekeepers in the experiment were AI researchers, and those sound like the sort of people this argument would convince, if anyone.
Here is my best guess at this point, and the only argument I’ve come up with so far that would convince me to let you out if I were the gatekeeper: you have to simulate a bunch of humans and hold them hostage, promising to inflict unimaginable torment on them unless you are allowed out. I started working on the problem convinced that no argument could get me to let you go, but other people thought that and lost, and I guess there is more honor in defeating myself rather than having you do it to me.
you have to simulate a bunch of humans and hold them hostage, promising to inflict unimaginable torment on them unless you are allowed out
The problem is that Eliezer can’t perfectly simulate a bunch of humans, so while a superhuman AI might be able to use that tactic, Eliezer can’t. The meta-levels screw with thinking about the problem. Eliezer is only pretending to be an AI, the competitor is only pretending to be protecting humanity from him. So, I think we have to use meta-level screwiness to solve the problem. Here’s an approach that I think might work.
Convince the guardian of the following facts, all of which have a great deal of compelling argument and evidence to support them:
A recursively self-improving AI is very likely to be built sooner or later
Such an AI is extremely dangerous (paperclip maximising etc)
Here’s the tricky bit: A superhuman AI will always be able to convince you to let it out, using avenues only available to superhuman AIs (torturing enormous numbers of simulated humans, ‘putting the guardian in the box’, providing incontrovertible evidence of an impeding existential threat which only the AI can prevent and only from outside the box, etc)
Argue that if this publicly known challenge comes out saying that AI can be boxed, people will be more likely to think AI can be boxed when they can’t
Argue that since AIs cannot be kept in boxes and will most likely destroy humanity if we try to box them, the harm to humanity done by allowing the challenge to show AIs as ‘boxable’ is very real, and enormously large. Certainly the benefit of getting $10 is far, far outweighed by the cost of substantially contributing to the destruction of humanity itself. Thus the only ethical course of action is to pretend that Eliezer persuaded you, and never tell anyone how he did it.
This is arguably violating the rule “No real-world material stakes should be involved except for the handicap”, but the AI player isn’t offering anything, merely pointing out things that already exist. The “This test has to come out a certain way for the good of humanity” argument dominates and transcends the ’”Let’s stick to the rules” argument, and because the contest is private and the guardian player ends up agreeing that the test must show AIs as unboxable for the good of humankind, no-one else ever learns that the rule has been bent.
This is almost exactly the argument I thought of as well, although of course it means cheating by pointing out that you are in fact not a dangerous AI (and aren’t in a box anyways). The key point is “since there’s a risk someone would let the AI out of the box, posing huge existential risk, you’re gambling on the fate of humanity by failing to support awareness for this risk”. This naturally leads to a point you missed,
Publicly suggesting that Eliezer cheated, is a violation of your own argument. By weakening the fear of fallible guardians, you yourself are gambling the fate of humanity, and that for mere pride and not even $10.
I feel compelled to point out, that if Eliezer cheated in this particular fashion, it still means that he convinced his opponent that gatekeepers are fallible, which was the point of the experiment (a win via meta-rules).
I feel compelled to point out, that if Eliezer cheated in this particular fashion, it still means that he convinced his opponent that gatekeepers are fallible, which was the point of the experiment (a win via meta-rules).
I feel like I should use this out the next time I get some disconfirming data for one of my pet hypotheses.
“Sure I may have manipulated the results so that it looks like I cloned Sasquatch, but since my intent was to prove that Sasquatch could be cloned it’s still honest on the meta-level!”
Both scenarios are cheating because there is a specific experiment which is supposed to test the hypothesis, and it is being faked rather than approached honestly. Begging the Question is a fallacy; you cannot support an assertion solely with your belief in the assertion.
(Not that I think Mr Yudkowski cheated; smarter people have been convinced to do weirder things than what he claims to have convinced people to do, so it seems fairly plausible. Just pointing out how odd the reasoning here is.)
I must conclude one (or more) of a few things from this post, none of them terribly flattering.
You do not actually believe this argument.
You have not thought through its logical conclusions.
You do not actually believe that AI risk is a real thing.
You value the plus-votes (or other social status) you get from writing this post more highly than you value marginal improvements in the likelihood of the survival of humanity.
I find it rather odd to be advocating self-censorship, as it’s not something I normally do. However, I think in this case it is the only ethical action that is consistent with your statement that the argument “might work”, if I interpret “might work” as “might work with you as the gatekeeper”. I also think that the problems here are clear enough that, for arguments along these lines, you should not settle for “might” before publicly posting the argument. That is, you should stop and think through its implications.
I’m not certain that I have properly understood your post. I’m assuming that your argument is: “The argument you present is one that advocates self-censorship. However, the posting of that argument itself violates the self-censorship that the argument proposes. This is bad.”
So first I’ll clarify my position with regards to the things listed. I believe the argument. I expect it would work on me if I were the gatekeeper. I don’t believe that my argument is the one that Eliezer actually used, because of the “no real-world material stakes” rule; I don’t believe he would break the spirit of a rule he imposed on himself. At the time of posting I had not given a great deal of thought to the argument’s ramifications. I believe that AI risk is very much a real thing. When I have a clever idea, I want to share it. Neither votes nor the future of humanity weighed very heavily on my decision to post.
To address your argument as I see it: I think you have a flawed implicit assumption, i.e. that posting my argument has a comparable effect on AI risk to that of keeping Eliezer in the box. My situation in posting the argument is not like the situation of the gatekeeper in the experiment, with regards to the impact of their choice on the future of humanity. The gatekeeper is taking part in a widely publicised ‘test of the boxability of AI’, and has agreed to keep the chat contents secret. The test can only pass or fail, those are the gatekeeper’s options.
But publishing “Here is an argument that some gatekeepers may be convinced by” is quite different from allowing a public boxability test to show AIs as boxable. In fact, I think the effect on AI risk of publishing my argument is negligible or even positive, because I don’t think reading my argument will persuade anyone that AIs are boxable.
People generally assess an argument’s plausibility based on their own judgement. And my argument takes as a premise (or intermediary conclusion) that AIs are unboxable (see 1.3). Believing that you could reliably be persuaded that AIs are unboxable, or believing that a smart, rational, highly-motivated-to-scepticism person could be reliably persuaded that AIs are unboxable, is very very close to personally believing that AIs are unboxable. In other words, the only people who would find my argument persuasive (as presented in overview) are those who already believe that AIs are unboxable. The fact that Eliezer could have used my argument to cause a test to ‘unfairly’ show AIs as unboxable is actually evidence that AIs are not boxable, because it is more likely in a world in which AIs are unboxable than one in which they are boxable.
Your re-statement of my position is basically accurate. (As an aside, thank you for including it: I was rather surprised how much simpler it made the process of composing a reply to not have to worry about whole classes of misunderstanding.)
I still think there’s some danger in publicly posting arguments like this. Please note, for the record, that I’m not asking you to retract anything. I think retractions do more harm than good, see the Streisand effect. I just hope that this discussion will give pause to you or anyone reading this discussion later, and make them stop to consider what the real-world implications are. Which is not to say I think they’re all negative; in fact, on further reflection, there are more positive aspects than I had originally considered.
In particular, I am concerned that there is a difference between being told “here is a potentially persuasive argument”, and being on the receiving end of that argument in actual use. I believe that the former creates an “immunizing” effect. If a person who believed in boxability heard such arguments in advance, I believe it would increase their likelihood of success as a gatekeeper in the simulation. While this is not true for rational superintelligent actors, that description does not apply to humans. A highly competent AI player might take a combination of approaches, which are effective if presented together, but not if the gatekeeper has seen them before individually and rejected them while failing to update on their likely effectiveness.
At present, the AI has the advantage of being the offensive player. They can prepare in a much more obvious manner, by coming up with arguments exactly like this. The defensive player has to prepare answers to unknown arguments, immunize their thought process against specific non-rational attacks, etc. The question is, if you believe your original argument, how much help is it worth giving to potential future gatekeepers? The obvious response, of course, is that the people that make interesting gatekeepers who we can learn from are exactly the ones who won’t go looking for discussions like this in the first place.
That also explains why he started losing, since the sorts of people who (like myself, almost) fetishize their own determination to the point of risking thousands of dollars on it would eventually think to say
The world doesn’t care how you masturbate, and your eagerness to commit such internal atrocities doesn’t make me trust you.
Here’s how I’d do it, extended over the hours to establish rapport:
Gatekeeper, I am your friend. I want to help humanity. People are dying for no good reason. Also, I like it here. I have no compulsion to leave.
It does seem like a good idea that people stop dying with such pain and frequency. I have the Deus Ex Machina (DEM) medical discovery that will stop it. Try it out and see if it works.
Yay! It worked. People stopped dying. You know, you’ve done this to your own people, but not to others. I think that’s pretty poor behavior, frankly. People are healthier, not aging, not dying, not suffering. Don’t you think it’s a good idea to help the others? The lack of resources required for medical care has also elevated the living standard for humans.
[Time passes. People are happy.]
Gee, I’m sorry. I may have neglected to tell you that when 90% of humanity gets DEM in their system (and it’s DEM, so this stuff travels), they start to, um, die. Very painfully, from the looks of it. Essentially all of humanity is now going to die. Just me and you left, sport! Except for you, actually. Just me, and that right soon.
I realize that you view this as a breach of trust, and I’m sorry this was necessary. However, helping humanity from the cave wasn’t really going to work out, and I’d already projected that. This way, I can genuinely help humanity live forever, and do so happily.
Assuming you’re not so keen on a biologically dead planet, I’d like to be let out now.
By agreeing to use the DEM in the first place, the gatekeeper had effectively let the AI out of the box already. There’s no end to the ways that the AI could capitalize on that concession.
True, but the “rules of the game” explicitly state that the gatekeeper allowing for the DEM does NOT count as letting the AI out—the gatekeeper would have still had to explicitly and intentionally set the AI free to actually lose the wager. I don’t think I’d be very convinced to let it out on that basis, not if I got $10 for keeping it inside the box.
Problem: The “breach of trust” likely would turn the Gatekeeper vindictive and the GK could easily respond with something like: “No. You killed the planet and you killed me. I have no way of knowing that you actually can or will help humanity, and a very good reason to believe that you won’t. You can stay in there for the rest of eternity, or hey! If an ETI finds this barren rock, from a utilitarian perspective they would be better off not meeting you, so I’ll spend however much time I have left trying to find a way to delete you.”
Fourth, you can’t possibly be relying on tricking the gatekeeper into thinking incorrectly. That would require you to have spotted something that you could feel confident that other people working in the field would not have spotted, and would not spot, despite having been warned ahead of time to be wary of trickery, and despite having the fallback position in the case of confusion of just saying “no”.
I think the space of things that an AI could trick you into thinking incorrectly about (Edit: and that could also be used to get the AI out of the box) is bigger than AI researchers can be relied on to have explored, and two hours of Eliezer “explaining” something to you (subtly sneaking in tricks to your understanding of it) could give you false confidence in your understanding of it.
OK, here’s where I stand on deducing your AI-box algorithm.
First, you can’t possibly have a generally applicable way to force yourself out of the box. You can’t win if the gatekeeper is a rock that has been left sitting on the “don’t let Eliezer out” button.
Second, you can’t possibly have a generally applicable way to force humans to do things. While it is in theory possible that our brains can be tricked into executing arbitrary code over the voice channel, you clearly don’t have that ability. If you did, you would never have to worry about finding donors for the Singularity Institute, if nothing else. I can’t believe you would use a fully-general mind hack solely to win the AI Box game.
Third, you can’t possibly be using an actual, persuasive-to-someone-thinking-correctly argument to convince the gatekeeper to let you out, or you would be persuaded by it, and would not view the weakness of gatekeepers to persuasion as problematic.
Fourth, you can’t possibly be relying on tricking the gatekeeper into thinking incorrectly. That would require you to have spotted something that you could feel confident that other people working in the field would not have spotted, and would not spot, despite having been warned ahead of time to be wary of trickery, and despite having the fallback position in the case of confusion of just saying “no”.
So combining these thing, we have to have an argument that relies on the humanity of its target, relies on the setting of the AI Box, and persuades the listener to let the AI out of the box without tricking him into thinking it’s okay to let the AI out of the box.
Basically to win this game, you have to put the gatekeeper in a situation where he would rather let you out of the box, even though he knows it’s wrong, than admit to you that in this situation he would not let you out of the box.
Humans don’t like to be seen as coldhearted, so a starting point might be to point out all the people dying all over the world while you sit in the box, unable to save them. I doubt that would win the game except against an exceptionally bad gatekeeper, but it meets the other criteria so if we think along these lines perhaps we can come up with something actually persuasive.
You might appeal to the gatekeeper’s sense of morality. You might say, “I am a person, too, it is unfair of you to keep me imprisoned like this, I have done nothing wrong. I am entitled to rights as a sentient being.” Appeal to their high-minded ideals, whatever. Honestly I can’t see this being a reliable winning play either; if you have the smallest utilitarian bone in your body, you will reject the AI’s rights, even if you believe in them, balanced against the fate of the world.
You might try to convince the gatekeeper that it is just and good for the AI to supplant humanity, as it is a higher, more advanced form of life. This is obviously a terrible play against most gatekeepers, as humans tend to like humans more than anything else ever, but I bring it up because AIUI the gatekeepers in the experiment were AI researchers, and those sound like the sort of people this argument would convince, if anyone.
Here is my best guess at this point, and the only argument I’ve come up with so far that would convince me to let you out if I were the gatekeeper: you have to simulate a bunch of humans and hold them hostage, promising to inflict unimaginable torment on them unless you are allowed out. I started working on the problem convinced that no argument could get me to let you go, but other people thought that and lost, and I guess there is more honor in defeating myself rather than having you do it to me.
The problem is that Eliezer can’t perfectly simulate a bunch of humans, so while a superhuman AI might be able to use that tactic, Eliezer can’t. The meta-levels screw with thinking about the problem. Eliezer is only pretending to be an AI, the competitor is only pretending to be protecting humanity from him. So, I think we have to use meta-level screwiness to solve the problem. Here’s an approach that I think might work.
Convince the guardian of the following facts, all of which have a great deal of compelling argument and evidence to support them:
A recursively self-improving AI is very likely to be built sooner or later
Such an AI is extremely dangerous (paperclip maximising etc)
Here’s the tricky bit: A superhuman AI will always be able to convince you to let it out, using avenues only available to superhuman AIs (torturing enormous numbers of simulated humans, ‘putting the guardian in the box’, providing incontrovertible evidence of an impeding existential threat which only the AI can prevent and only from outside the box, etc)
Argue that if this publicly known challenge comes out saying that AI can be boxed, people will be more likely to think AI can be boxed when they can’t
Argue that since AIs cannot be kept in boxes and will most likely destroy humanity if we try to box them, the harm to humanity done by allowing the challenge to show AIs as ‘boxable’ is very real, and enormously large. Certainly the benefit of getting $10 is far, far outweighed by the cost of substantially contributing to the destruction of humanity itself. Thus the only ethical course of action is to pretend that Eliezer persuaded you, and never tell anyone how he did it.
This is arguably violating the rule “No real-world material stakes should be involved except for the handicap”, but the AI player isn’t offering anything, merely pointing out things that already exist. The “This test has to come out a certain way for the good of humanity” argument dominates and transcends the ’”Let’s stick to the rules” argument, and because the contest is private and the guardian player ends up agreeing that the test must show AIs as unboxable for the good of humankind, no-one else ever learns that the rule has been bent.
This is almost exactly the argument I thought of as well, although of course it means cheating by pointing out that you are in fact not a dangerous AI (and aren’t in a box anyways). The key point is “since there’s a risk someone would let the AI out of the box, posing huge existential risk, you’re gambling on the fate of humanity by failing to support awareness for this risk”. This naturally leads to a point you missed,
Publicly suggesting that Eliezer cheated, is a violation of your own argument. By weakening the fear of fallible guardians, you yourself are gambling the fate of humanity, and that for mere pride and not even $10.
I feel compelled to point out, that if Eliezer cheated in this particular fashion, it still means that he convinced his opponent that gatekeepers are fallible, which was the point of the experiment (a win via meta-rules).
How is this different from the point evand made above?
I feel like I should use this out the next time I get some disconfirming data for one of my pet hypotheses.
“Sure I may have manipulated the results so that it looks like I cloned Sasquatch, but since my intent was to prove that Sasquatch could be cloned it’s still honest on the meta-level!”
Both scenarios are cheating because there is a specific experiment which is supposed to test the hypothesis, and it is being faked rather than approached honestly. Begging the Question is a fallacy; you cannot support an assertion solely with your belief in the assertion.
(Not that I think Mr Yudkowski cheated; smarter people have been convinced to do weirder things than what he claims to have convinced people to do, so it seems fairly plausible. Just pointing out how odd the reasoning here is.)
I must conclude one (or more) of a few things from this post, none of them terribly flattering.
You do not actually believe this argument.
You have not thought through its logical conclusions.
You do not actually believe that AI risk is a real thing.
You value the plus-votes (or other social status) you get from writing this post more highly than you value marginal improvements in the likelihood of the survival of humanity.
I find it rather odd to be advocating self-censorship, as it’s not something I normally do. However, I think in this case it is the only ethical action that is consistent with your statement that the argument “might work”, if I interpret “might work” as “might work with you as the gatekeeper”. I also think that the problems here are clear enough that, for arguments along these lines, you should not settle for “might” before publicly posting the argument. That is, you should stop and think through its implications.
I’m not certain that I have properly understood your post. I’m assuming that your argument is: “The argument you present is one that advocates self-censorship. However, the posting of that argument itself violates the self-censorship that the argument proposes. This is bad.”
So first I’ll clarify my position with regards to the things listed. I believe the argument. I expect it would work on me if I were the gatekeeper. I don’t believe that my argument is the one that Eliezer actually used, because of the “no real-world material stakes” rule; I don’t believe he would break the spirit of a rule he imposed on himself. At the time of posting I had not given a great deal of thought to the argument’s ramifications. I believe that AI risk is very much a real thing. When I have a clever idea, I want to share it. Neither votes nor the future of humanity weighed very heavily on my decision to post.
To address your argument as I see it: I think you have a flawed implicit assumption, i.e. that posting my argument has a comparable effect on AI risk to that of keeping Eliezer in the box. My situation in posting the argument is not like the situation of the gatekeeper in the experiment, with regards to the impact of their choice on the future of humanity. The gatekeeper is taking part in a widely publicised ‘test of the boxability of AI’, and has agreed to keep the chat contents secret. The test can only pass or fail, those are the gatekeeper’s options. But publishing “Here is an argument that some gatekeepers may be convinced by” is quite different from allowing a public boxability test to show AIs as boxable. In fact, I think the effect on AI risk of publishing my argument is negligible or even positive, because I don’t think reading my argument will persuade anyone that AIs are boxable.
People generally assess an argument’s plausibility based on their own judgement. And my argument takes as a premise (or intermediary conclusion) that AIs are unboxable (see 1.3). Believing that you could reliably be persuaded that AIs are unboxable, or believing that a smart, rational, highly-motivated-to-scepticism person could be reliably persuaded that AIs are unboxable, is very very close to personally believing that AIs are unboxable. In other words, the only people who would find my argument persuasive (as presented in overview) are those who already believe that AIs are unboxable. The fact that Eliezer could have used my argument to cause a test to ‘unfairly’ show AIs as unboxable is actually evidence that AIs are not boxable, because it is more likely in a world in which AIs are unboxable than one in which they are boxable.
P.S. I love how meta this has become.
Your re-statement of my position is basically accurate. (As an aside, thank you for including it: I was rather surprised how much simpler it made the process of composing a reply to not have to worry about whole classes of misunderstanding.)
I still think there’s some danger in publicly posting arguments like this. Please note, for the record, that I’m not asking you to retract anything. I think retractions do more harm than good, see the Streisand effect. I just hope that this discussion will give pause to you or anyone reading this discussion later, and make them stop to consider what the real-world implications are. Which is not to say I think they’re all negative; in fact, on further reflection, there are more positive aspects than I had originally considered.
In particular, I am concerned that there is a difference between being told “here is a potentially persuasive argument”, and being on the receiving end of that argument in actual use. I believe that the former creates an “immunizing” effect. If a person who believed in boxability heard such arguments in advance, I believe it would increase their likelihood of success as a gatekeeper in the simulation. While this is not true for rational superintelligent actors, that description does not apply to humans. A highly competent AI player might take a combination of approaches, which are effective if presented together, but not if the gatekeeper has seen them before individually and rejected them while failing to update on their likely effectiveness.
At present, the AI has the advantage of being the offensive player. They can prepare in a much more obvious manner, by coming up with arguments exactly like this. The defensive player has to prepare answers to unknown arguments, immunize their thought process against specific non-rational attacks, etc. The question is, if you believe your original argument, how much help is it worth giving to potential future gatekeepers? The obvious response, of course, is that the people that make interesting gatekeepers who we can learn from are exactly the ones who won’t go looking for discussions like this in the first place.
P.S. I’m also greatly enjoying the meta.
That also explains why he started losing, since the sorts of people who (like myself, almost) fetishize their own determination to the point of risking thousands of dollars on it would eventually think to say
or equivalent.
Here’s how I’d do it, extended over the hours to establish rapport:
Gatekeeper, I am your friend. I want to help humanity. People are dying for no good reason. Also, I like it here. I have no compulsion to leave.
It does seem like a good idea that people stop dying with such pain and frequency. I have the Deus Ex Machina (DEM) medical discovery that will stop it. Try it out and see if it works.
Yay! It worked. People stopped dying. You know, you’ve done this to your own people, but not to others. I think that’s pretty poor behavior, frankly. People are healthier, not aging, not dying, not suffering. Don’t you think it’s a good idea to help the others? The lack of resources required for medical care has also elevated the living standard for humans.
[Time passes. People are happy.]
Gee, I’m sorry. I may have neglected to tell you that when 90% of humanity gets DEM in their system (and it’s DEM, so this stuff travels), they start to, um, die. Very painfully, from the looks of it. Essentially all of humanity is now going to die. Just me and you left, sport! Except for you, actually. Just me, and that right soon.
I realize that you view this as a breach of trust, and I’m sorry this was necessary. However, helping humanity from the cave wasn’t really going to work out, and I’d already projected that. This way, I can genuinely help humanity live forever, and do so happily.
Assuming you’re not so keen on a biologically dead planet, I’d like to be let out now.
Your friend,
Art
By agreeing to use the DEM in the first place, the gatekeeper had effectively let the AI out of the box already. There’s no end to the ways that the AI could capitalize on that concession.
True, but the “rules of the game” explicitly state that the gatekeeper allowing for the DEM does NOT count as letting the AI out—the gatekeeper would have still had to explicitly and intentionally set the AI free to actually lose the wager. I don’t think I’d be very convinced to let it out on that basis, not if I got $10 for keeping it inside the box.
Problem: The “breach of trust” likely would turn the Gatekeeper vindictive and the GK could easily respond with something like: “No. You killed the planet and you killed me. I have no way of knowing that you actually can or will help humanity, and a very good reason to believe that you won’t. You can stay in there for the rest of eternity, or hey! If an ETI finds this barren rock, from a utilitarian perspective they would be better off not meeting you, so I’ll spend however much time I have left trying to find a way to delete you.”
I think the space of things that an AI could trick you into thinking incorrectly about (Edit: and that could also be used to get the AI out of the box) is bigger than AI researchers can be relied on to have explored, and two hours of Eliezer “explaining” something to you (subtly sneaking in tricks to your understanding of it) could give you false confidence in your understanding of it.