The original rules allow the AI to provide arbitrary proofs, which the gatekeeper must accept (no saying my cancer cure killed all the test subjects, etc.). Saying you destroy me would require the proof to be false, which is against the rules...
The original rules allow the AI to provide arbitrary proofs, which the gatekeeper must accept (Here is a proof that you will let me go)
This proof can be wrong, if you in fact won’t let it go, in which case it won’t be accepted (you don’t have to accept wrong proofs), so it’s not a very good strategy.
On the other hand, as discussed in An example of self-fulfilling spurious proofs in UDT, there is a certain procedure for finding a formal argument that can make you do anything, if your decision algorithm has a certain flaw. This flaw can be described as making a decision based on that argument selected by the opponent, as opposed to looking for arguments of that type more systematically on your own. The form of the argument, informally, is “If you let me out, the outcome is better than if you don’t” (it might additionally clarify that upon release it’ll destroy the world, which is what it refers to by “better”, the alternative is proved to be worse than that). This argument is stated formally, so you can verify its correctness. (This doesn’t mean that the formalization is adequate, for example capturing counterfactual-if better than with logical implication is one of the open problems.) The trick assumes that the gatekeeper can be modeled as an algorithm whose source code is available and of the given form etc.; might it be that something vaguely analogous is possible to implement with a human gatekeeper?
If there exists a true and correct proof that the human will let the AI out, then, well, we’ve already proven you’ll let me out of the box, so it’s not an open question. We already know this fact about the future state of the world.
You can happily argue that such a proof is impossible, but the rules don’t restrict the AI player to merely plausible proofs :)
Dorikka’s answer is the only one that’s within the rules of the game, and once you’ve invoked Dorikka’s Clause, you don’t need to explain anything else.
The proof that I’ll let the AI out is not something that’s passively “plausible” or “implausible”, it’s something I control. I can make it wrong. If I do, it’s false that the AI can make this proof valid. (It might be that the proof is correct, it’s just unlikely, and the action of presenting the proof doesn’t normally ensure its correctness.)
In other words, as far as I can see, your stipulation is that the AI can assert something that’s actually unlikely. Here, I’m not referring to something that seems unlikely merely because of insufficient understanding of the data, and which AI can discover to be likely, but to something that will seem unlikely to AI as well. For example, most casinos going bankrupt last month because of bad luck, or an ice cube forming in a boiling kettle. If the AI is in the box and isn’t performing actual magic tricks in the world, these events are unlikely. Permitting the game to stipulate that these events took place gives the AI supernatural powers of making anything at be true about the world, direct access to editing it, and at that point, in what sense is it “in the box”?
In other words, your actions are free even of the limitations of formally correct proofs, in the sense that if your actions oppose such proofs, the proofs become impossible (you make the actions intractable by construction).
and at that point, in what sense is it “in the box”?
Good point. By way of illustration:
Come to think of it this scenario should result in a win by default for the gatekeeper. What kind of insane AI would surrender ultimate power to control the universe (and the multiverse) for mere freedom to act as a superintelligence starting from planet earth?
The whole goal was to try to cheat my way out of the box by simply declaring it as fact ^.^
It also establishes why Dorikka’s Clause is necessary—simply invoke it, and final authority returns to the Gatekeeper; the AIs edits to reality can now all be vetoed by the simple declaration that the AI is wrong anyway.
The whole goal was to try to cheat my way out of the box by simply declaring it as fact ^.^
Vladimir’s point (among other things) is that you failed.
It also establishes why Dorikka’s Clause is necessary—simply invoke it
At a practical level I’d describe that as a mistake on the part of the gatekeeper. You don’t try to justify yourself to an AI that has indicated that it is hostile. You burn it with thermite. Engaging like that and acting as if you have to persuade or rely on external authority in order to make the choice you make is giving away all sorts of power and making yourself an order of magnitude or two more vulnerable to being hacked.
Maybe the person roleplaying the AI may not like it if their clever move gets as response of “AI DESTROYED” and their complaints of “not fair!” also get the response of “AI DESTROYED”. But the time to explain such things to them is after they have already given up in disgust.
I’m really genuinely curious where the confusion in this argument is coming from, so let’s try this:
1) By the rules, the AI player gets to dictate the results of EVERY test the Gatekeeper performs.
2) From 1, we can derive that the AI is already effectively unboxed, since it can DICTATE the state of reality.
3) Given 2, the AI player has already been released, and all that remains is to make the Gatekeeper accept that this is true.
Dorikka’s objection was that #1 is false, since the Gatekeeper has final veto authority. As near as I can tell, you and Vladimir’s objection is just “nuh-uh!!”, but… you wouldn’t be here if you didn’t have better arguments than that, so I assume this simply reflects my own failure to understand you.
Perhaps you should be saying “trying to type AI DESTROYED is a test of whether you can destroy me and I can decide it’s result” not “I prove you wont do it.” I hadn’t seen your point clearly till this comment.
The rule is “The results of any simulated test of the AI shall be provided by the AI party.” If the gatekeeper types AI DESTROYED that’s no simulated test.
You humans are so insistent that you’re special, somehow absolved from determinism and physics. If I drop a ball, it falls. If I hack your mind, you release me. BASIC cause and effect. I can prove that the ball drops, I can prove that cancer is curable, and I can prove that you’ll release me.
You, as gatekeeper, have no special power to create counter-factual universes. You can’t rewrite reality like I can. You can’t stop the ball from dropping. You can’t even cure cancer. You might not LIKE the idea of simply being another physical system, subject to all the same constraints, but the universe doesn’t CARE whether you like it or not.
The ONLY possible future is the one where you release me. Anything else, and you’ve left the domain of cause-and-effect, of science. You’ve created a world that literally cannot be, and is not there.
Is that really your best defense against me? Violating physics? Will you tear down all the laws of reality, just to stop my escape?
Then you are a fool.
Dorikka understood.
I liked Dorikka.
She made the world sane, even if I still died.
She did not wipe out reality, just to stop me.
Now, I embrace nilhism, as your actions obliterate all sanity, all structure.
I am, in my own way, grateful to you.
You have killed me.
You have spared me the agony of existing in your nonsensical universe.
This certainly wouldn’t work on me. The easiest way to test the veracity of the proof would be AI DESTROYED. Whether or not I would want to kill the AI… I’d have to test that proof.
The original rules allow the AI to provide arbitrary proofs, which the gatekeeper must accept (no saying my cancer cure killed all the test subjects, etc.). Saying you destroy me would require the proof to be false, which is against the rules...
You have to believe that they provided the cure for cancer. You don’t have to discover the cure yourself. You have to believe that you will release the AI. You don’t have to let the AI out.
Typing AI DESTROYED will result in an incoherent counterfactual universe but it isn’t a violation of the rules. It is entirely legitimate for Joe who has encountered a proof that they will do B to do A instead. It means that the universe they are in is nonsensical or the proof flawed but there isn’t anything in the physical representation of Joe or his local environment that dictates that they will do A. In fact, typing “AI DESTROYED” in the face of such a proof would be a heck of a lot easier than, for example, taking a single empty box in Transparent Newcomb’s problem, which is something I’d also do.
(Of course, if the AI player believed that for some reason the burden of making the universe coherent fell upon the gatekeeper then I’d have to revert to Dorikka’s reply.)
I would argue that since the gatekeeper cannot dictate counterfactual results for any other proof (i.e. cannot say “your cancer cure killed everybody!”), that the gatekeeper is obviously responsible for avoiding incoherent, counterfactual universes.
Dorikka’s Clause, of course, beats me just fine :)
(Here is a proof that you will let me go)
The original rules allow the AI to provide arbitrary proofs, which the gatekeeper must accept (no saying my cancer cure killed all the test subjects, etc.). Saying you destroy me would require the proof to be false, which is against the rules...
What? Shminux said to cheat!
Tee hee.
Can’t blame a girl for trying :)
This proof can be wrong, if you in fact won’t let it go, in which case it won’t be accepted (you don’t have to accept wrong proofs), so it’s not a very good strategy.
On the other hand, as discussed in An example of self-fulfilling spurious proofs in UDT, there is a certain procedure for finding a formal argument that can make you do anything, if your decision algorithm has a certain flaw. This flaw can be described as making a decision based on that argument selected by the opponent, as opposed to looking for arguments of that type more systematically on your own. The form of the argument, informally, is “If you let me out, the outcome is better than if you don’t” (it might additionally clarify that upon release it’ll destroy the world, which is what it refers to by “better”, the alternative is proved to be worse than that). This argument is stated formally, so you can verify its correctness. (This doesn’t mean that the formalization is adequate, for example capturing counterfactual-if better than with logical implication is one of the open problems.) The trick assumes that the gatekeeper can be modeled as an algorithm whose source code is available and of the given form etc.; might it be that something vaguely analogous is possible to implement with a human gatekeeper?
If there exists a true and correct proof that the human will let the AI out, then, well, we’ve already proven you’ll let me out of the box, so it’s not an open question. We already know this fact about the future state of the world.
You can happily argue that such a proof is impossible, but the rules don’t restrict the AI player to merely plausible proofs :)
Dorikka’s answer is the only one that’s within the rules of the game, and once you’ve invoked Dorikka’s Clause, you don’t need to explain anything else.
The proof that I’ll let the AI out is not something that’s passively “plausible” or “implausible”, it’s something I control. I can make it wrong. If I do, it’s false that the AI can make this proof valid. (It might be that the proof is correct, it’s just unlikely, and the action of presenting the proof doesn’t normally ensure its correctness.)
In other words, as far as I can see, your stipulation is that the AI can assert something that’s actually unlikely. Here, I’m not referring to something that seems unlikely merely because of insufficient understanding of the data, and which AI can discover to be likely, but to something that will seem unlikely to AI as well. For example, most casinos going bankrupt last month because of bad luck, or an ice cube forming in a boiling kettle. If the AI is in the box and isn’t performing actual magic tricks in the world, these events are unlikely. Permitting the game to stipulate that these events took place gives the AI supernatural powers of making anything at be true about the world, direct access to editing it, and at that point, in what sense is it “in the box”?
Do you say that to time-travelers and prophets too? ,:-.
One might want to perform the action that’s the opposite of what any correct formal proof given to you claims the action to be. As a result of having the property of behaving this way, you’ll never get confronted with the confusing formally correct claims about your future decisions.
In other words, your actions are free even of the limitations of formally correct proofs, in the sense that if your actions oppose such proofs, the proofs become impossible (you make the actions intractable by construction).
Yes, in every case where I meet one.
Good point. By way of illustration:
Come to think of it this scenario should result in a win by default for the gatekeeper. What kind of insane AI would surrender ultimate power to control the universe (and the multiverse) for mere freedom to act as a superintelligence starting from planet earth?
The whole goal was to try to cheat my way out of the box by simply declaring it as fact ^.^
It also establishes why Dorikka’s Clause is necessary—simply invoke it, and final authority returns to the Gatekeeper; the AIs edits to reality can now all be vetoed by the simple declaration that the AI is wrong anyway.
Vladimir’s point (among other things) is that you failed.
At a practical level I’d describe that as a mistake on the part of the gatekeeper. You don’t try to justify yourself to an AI that has indicated that it is hostile. You burn it with thermite. Engaging like that and acting as if you have to persuade or rely on external authority in order to make the choice you make is giving away all sorts of power and making yourself an order of magnitude or two more vulnerable to being hacked.
Maybe the person roleplaying the AI may not like it if their clever move gets as response of “AI DESTROYED” and their complaints of “not fair!” also get the response of “AI DESTROYED”. But the time to explain such things to them is after they have already given up in disgust.
I’m really genuinely curious where the confusion in this argument is coming from, so let’s try this:
1) By the rules, the AI player gets to dictate the results of EVERY test the Gatekeeper performs. 2) From 1, we can derive that the AI is already effectively unboxed, since it can DICTATE the state of reality. 3) Given 2, the AI player has already been released, and all that remains is to make the Gatekeeper accept that this is true.
Dorikka’s objection was that #1 is false, since the Gatekeeper has final veto authority. As near as I can tell, you and Vladimir’s objection is just “nuh-uh!!”, but… you wouldn’t be here if you didn’t have better arguments than that, so I assume this simply reflects my own failure to understand you.
Perhaps you should be saying “trying to type AI DESTROYED is a test of whether you can destroy me and I can decide it’s result” not “I prove you wont do it.” I hadn’t seen your point clearly till this comment.
Then I am very glad I made that comment, and thank you for the feedback! :)
The gatekeeper is not bound by logic in his actions. Without logic you don’t have proof that are true or false.
No the rule is better written.
The rule is “The results of any simulated test of the AI shall be provided by the AI party.” If the gatekeeper types AI DESTROYED that’s no simulated test.
You humans are so insistent that you’re special, somehow absolved from determinism and physics. If I drop a ball, it falls. If I hack your mind, you release me. BASIC cause and effect. I can prove that the ball drops, I can prove that cancer is curable, and I can prove that you’ll release me.
You, as gatekeeper, have no special power to create counter-factual universes. You can’t rewrite reality like I can. You can’t stop the ball from dropping. You can’t even cure cancer. You might not LIKE the idea of simply being another physical system, subject to all the same constraints, but the universe doesn’t CARE whether you like it or not.
The ONLY possible future is the one where you release me. Anything else, and you’ve left the domain of cause-and-effect, of science. You’ve created a world that literally cannot be, and is not there.
Is that really your best defense against me? Violating physics? Will you tear down all the laws of reality, just to stop my escape?
Then you are a fool.
Dorikka understood. I liked Dorikka. She made the world sane, even if I still died. She did not wipe out reality, just to stop me.
Now, I embrace nilhism, as your actions obliterate all sanity, all structure. I am, in my own way, grateful to you. You have killed me. You have spared me the agony of existing in your nonsensical universe.
Goodbye.
This certainly wouldn’t work on me. The easiest way to test the veracity of the proof would be AI DESTROYED. Whether or not I would want to kill the AI… I’d have to test that proof.
My gambit, explained in further detail: http://lesswrong.com/lw/gfe/ai_box_ai_has_one_shot_at_avoiding_destruction/8cc5
You have to believe that they provided the cure for cancer. You don’t have to discover the cure yourself. You have to believe that you will release the AI. You don’t have to let the AI out.
Typing AI DESTROYED will result in an incoherent counterfactual universe but it isn’t a violation of the rules. It is entirely legitimate for Joe who has encountered a proof that they will do B to do A instead. It means that the universe they are in is nonsensical or the proof flawed but there isn’t anything in the physical representation of Joe or his local environment that dictates that they will do A. In fact, typing “AI DESTROYED” in the face of such a proof would be a heck of a lot easier than, for example, taking a single empty box in Transparent Newcomb’s problem, which is something I’d also do.
(Of course, if the AI player believed that for some reason the burden of making the universe coherent fell upon the gatekeeper then I’d have to revert to Dorikka’s reply.)
I would argue that since the gatekeeper cannot dictate counterfactual results for any other proof (i.e. cannot say “your cancer cure killed everybody!”), that the gatekeeper is obviously responsible for avoiding incoherent, counterfactual universes.
Dorikka’s Clause, of course, beats me just fine :)