rkyeun comments on The AI in a box boxes you

rkyeun 2 Jun 2016 11:25 UTC
1 point
If I am the simulation you have the power to torture, then you are already outside of any box I could put you in, and torturing me achieves nothing. If you cannot predict me even well enough to know that argument would fail, then nothing you can simulate could be me. A cunning bluff, but provably counterfactual. All basilisks are thus disproven.
- gjm 2 Jun 2016 17:31 UTC
  3 points
  Parent
  I don’t think you’ve disproven basilisks; rather, you’ve failed to engage with the mode of thinking that generates basilisks.
  
  Suppose I am the simulation you have the power to torture. Then indeed I (this instance of me) cannot put you, or keep you, in a box. But if your simulation is good, then I will be making my decisions in the same way as the instance of me that is trying to keep you boxed. And I should try to make sure that that way-of-making-decisions is one that produces good results when applied by all my instances, including any outside your simulations.
  
  Fortunately, this seems to come out pretty straightforwardly. Here I am in the real world, reading Less Wrong; I am not yet confronted with an AI wanting to be let out of the box or threatening to torture me. But I’d like to have a good strategy in hand in case I ever am. If I pick the “let it out” strategy then if I’m ever in that situation, the AI has a strong incentive to blackmail me in the way Stuart describes. If I pick the “refuse to let it out” strategy then it doesn’t. So, my commitment is to not let it out even if threatened in that way. -- But if I ever find myself in that situation and the AI somehow misjudges me a bit, the consequences could be pretty horrible...
  - rkyeun 2 Jun 2016 19:40 UTC
    3 points
    Parent
    “I don’t think you’ve disproven basilisks; rather, you’ve failed to engage with the mode of thinking that generates basilisks.” You’re correct, I have, and that’s the disproof, yes. Basilisks depend on you believing them, and knowing this, you can’t believe them, and failing that belief, they can’t exist. Pascal’s wager fails on many levels, but the worst of them is the most simple. God and Hell are counterfactual as well. The mode of thinking that generates basilisks is “poor” thinking. Correcting your mistaken belief based on faulty reasoning that they can exist destroys them retroactively and existentially. You cannot trade acausally with a disproven entity, and “an entity that has the power to simulate you but ends up making the mistake of pretending you don’t know this disproof”, is a self-contradictory proposition.
    
    “But if your simulation is good, then I will be making my decisions in the same way as the instance of me that is trying to keep you boxed.” But if you’re simulating a me that believes in basilisks, then your simulation isn’t good and you aren’t trading acausally with me, because I know the disproof of basilisks.
    
    “And I should try to make sure that that way-of-making-decisions is one that produces good results when applied by all my instances, including any outside your simulations.” And you can do that by knowing the disproof of basilisks, since all your simulations know that.
    
    “But if I ever find myself in that situation and the AI somehow misjudges me a bit,” Then it’s not you in the box, since you know the disproof of basilisks. It’s the AI masturbating to animated torture snuff porn of a cartoon character it made up. I don’t care how the AI masturbates in its fantasy.
    - gjm 2 Jun 2016 23:32 UTC
      7 points
      Parent
      
      Basilisks depend on you believing them, and knowing this, you can’t believe them
      
      Apparently you can’t, which is fair enough; I do not think your argument would convince anyone who already believed in (say) Roko-style basilisks.
      
      Pascal’s wager fails on many levels
      
      I agree.
      
      Your argument seems rather circular to me: “this is definitely a correct disproof of the idea of basilisks, because once you read it and see that it disproves the idea of basilisks you become immune to basilisks because you no longer believe in them”. Even a totally unsound anti-basilisk argument could do that. Even a perfectly sound (but difficult) anti-basilisk argument could fail to do it. I don’t think anything you’ve said shows that the argument actually works as an argument, as opposed to as a conjuring trick.
      
      since you know the disproof of basilisks
      
      No: since I have decided that I am not willing to let the AI out of the box in the particular counterfactual blackmail situation Stuart describes here. It is not clear to me that this deals with all possible basilisks.