Suppose you regularly watch a popular TV show where contestants play a game just like Newcomb’s Problem—before each contestant walks out on stage, the show’s host sets out two boxes: box A containing nothing, and box B containing $1,000. Then, the host examines the contestant’s profile in detail, and if she predicts they will walk away with only box A, she puts $1,000,000 in box A.
The show ends its first season with staggeringly high ratings, mostly because to everyone’s bewilderment, the host boasts a 100% success rate at predicting whether contestants will take both boxes or only box A. In season 2, she announces she’ll be doing something crazier by making the boxes transparent, insisting she’ll be able to maintain her prediction accuracy even then.
And for the first half of season 2, she Somehow Fucking Does It. Most contestants walk out on stage, let out a disappointed sigh upon seeing box A empty, then grumpily take both boxes since $1,000 is better than nothing. Every once in a while, a contestant walks out to discover $1,000,000 in box A and squeals in delight, euphoric that their precommitment worked, before taking only that box.
You’re super impressed, but as an avid viewer of the show and a devout causal decision theorist, you’re also getting Pretty Fucking Annoyed at this point. Why aren’t any of these one-boxers changing their minds once they see the money in box A? They could make an extra thousand bucks! You’d love to go onto the show to show them up yourself, but you’ve made the mistake of arguing for causal decision theory on internet forums before, so you know the host will quickly identify you as a two-boxer and foil your plan.
But that’s when a convenient memory hits you—the legal recourse in your universe for game show hosts paying their contestants less than minimum wage is exactly $1,001,000! You realize that if you walk out to see an empty box A, then you can one-box and sue the network! In either case, you have a guaranteed strategy to earn more than one million dollars, and more importantly, prove both the game show host and evidential decision theorists wrong!
You submit your application for the show. The network never gets back to you. Damn It.
Turns out having conditionals in your decision procedure makes doing game theory way harder for anyone who interacts with you, so agents that visibly update based on their circumstances don’t often get invited to play games to begin with. I wrote this short post mostly because I think it’s interesting that a non-updateless decision procedure can sorta break the transparent-box Newcomb’s Problem, but this last principle also seems like another reason to expect AIs (especially those that interact with other AIs) to develop deceptively interpretable decision procedures, and also to avoid interacting with other visibly complex agents.
Exploiting Newcomb’s Game Show
Suppose you regularly watch a popular TV show where contestants play a game just like Newcomb’s Problem—before each contestant walks out on stage, the show’s host sets out two boxes: box A containing nothing, and box B containing $1,000. Then, the host examines the contestant’s profile in detail, and if she predicts they will walk away with only box A, she puts $1,000,000 in box A.
The show ends its first season with staggeringly high ratings, mostly because to everyone’s bewilderment, the host boasts a 100% success rate at predicting whether contestants will take both boxes or only box A. In season 2, she announces she’ll be doing something crazier by making the boxes transparent, insisting she’ll be able to maintain her prediction accuracy even then.
And for the first half of season 2, she Somehow Fucking Does It. Most contestants walk out on stage, let out a disappointed sigh upon seeing box A empty, then grumpily take both boxes since $1,000 is better than nothing. Every once in a while, a contestant walks out to discover $1,000,000 in box A and squeals in delight, euphoric that their precommitment worked, before taking only that box.
You’re super impressed, but as an avid viewer of the show and a devout causal decision theorist, you’re also getting Pretty Fucking Annoyed at this point. Why aren’t any of these one-boxers changing their minds once they see the money in box A? They could make an extra thousand bucks! You’d love to go onto the show to show them up yourself, but you’ve made the mistake of arguing for causal decision theory on internet forums before, so you know the host will quickly identify you as a two-boxer and foil your plan.
But that’s when a convenient memory hits you—the legal recourse in your universe for game show hosts paying their contestants less than minimum wage is exactly $1,001,000! You realize that if you walk out to see an empty box A, then you can one-box and sue the network! In either case, you have a guaranteed strategy to earn more than one million dollars, and more importantly, prove both the game show host and evidential decision theorists wrong!
You submit your application for the show. The network never gets back to you. Damn It.
Turns out having conditionals in your decision procedure makes doing game theory way harder for anyone who interacts with you, so agents that visibly update based on their circumstances don’t often get invited to play games to begin with. I wrote this short post mostly because I think it’s interesting that a non-updateless decision procedure can sorta break the transparent-box Newcomb’s Problem, but this last principle also seems like another reason to expect AIs (especially those that interact with other AIs) to develop deceptively interpretable decision procedures, and also to avoid interacting with other visibly complex agents.