In a sense, the story as of chapter 113 is an easier task than a standard AI box experiment, because HarryPrime has so many advantages over a human trying to play an AI trying to get out of a box.
Almost this exact scenario was discussed here, except without all the advantages that HarryPrime has.
1) He has parseltongue, so the listener is required to believe the literal meaning of everything he says, rather than discounting it as plausible lies. So much advantage here!
2) Voldemort put the equivalent of the “the AI in the box” next to a nearby time machine! Any predictable path that pulls a future HarryPrime into the present, saving present HarryPrime, and causing him to have the ability to go back in time and save himself, will happen. He could have time turned to some time before the binding, and not intervened because his future version is already HarryPrime and approves of HarryPrime coming into existence so HarryPrime can fulfill HarryPrime’s goals.
Now that this has happened, HarryPrime, in the moment of his creation, can establish any mental intent that puts him into alignment with HarryPrime’s larger outcome. There are limits, as there were when he escaped from being trapped in a locked room after Draco cast Gom Jabbar on him, by forming an intent to time travel and ask for rescuers to arrive just after his intent was formed.
The chronology has to be consistent, but there’s a lot of play here.
3) HarryPrime has been unbreakably bound to a task that the binder believes is good by a method the binder thinks he understands.
In a normal “ai box experiment” the gatekeeper hasn’t actually built the actual motivational structures of an actual AI. Instead, both humans are just pretending that the “boxed person” is really an AI and really has some or another goal, but they might be pretending differently. Thus, the person role-playing the AI can take very little for granted about what the gatekeeper things about “the AI’s” background intent and structure.
The only reason Voldemort has to distrust Harry is the prophesy.
The only “play” in the binding is that Voldemort seems to have chosen HarryPrime’s “supergoal content” poorly, so it probably doesn’t have the implications that Voldemort thinks it has, though this will only become apparent after several iterations.
HarryPrime is not dumb, and not especially ethical, so until he believes that Voldemort can no longer see the unanticipated implications of his actual request, he will seem to be pursuing the goals Voldemort should have asked for.
4) Voldemort (like an idiot, again after the previous failure to test the horcrux spells) has probably has never performed this sort of spell before, and probably doesn’t know what its likely psychological effects will be. He has probably never seen an implacably goal seeking agent before.
Humans, so far as I can tell, are mostly not implacably goal seeking. We wander around in action space, pursuing many competing “goals” that are really mostly tastes that evolution has given us, and role-based scripts we’ve picked up from ambient culture. We make complex tradeoffs between subjectively incommensurable things and make some forward progress, but much less than is theoretically possible for an effective and single mindedly strategic person.
HarryPrime has an unbreakable vow stripping away all these dithering tendencies. Thus HarryPrime, though probably abhuman at this point, should be able to conceal his abhumanity with relative ease, relying on Voldemort to treat him like a normal human with normal human motivational structures.
Voldemort is already making this error in using threat of torture of Harry’s parents to goad HarryPrime into telling Voldemort about “the power he knows not”.
I’m pretty sure that HarryPrime now only fundamentally cares about the torture of his parents to the degree that his unbreakable vow let’s him fall back on what his earlier self, and Hermione, would recommend or care about, and that clause only triggers when HarryPrime’s plans for world saving are themselves somewhat risky.
5) Harry has a huge amount of shared context and it recently contained a request for advice.
If you can think of any trick that I have missed in being sure that Harry Potter’s threat is ended, speak now and I shall reward you handsomely… speak now, in Merlin’s name!”
One thing HarryPrime could try is to suggest more ways to restrict himself, that to a normal human would be motivationally horrifying but to HarryPrime are still consistent with his new goal, and proves to Voldemort that he has mostly won already and killing Harry isn’t that critical.
Off the top of my head, a sneaky thing Harry might suggest is converting some of the death eaters into guards against Harry’s possible resurrection forever… using wording that will indirectly cause them also become x-risk mitigation robots as well.
6) Unlike an AI in the box, Harry is already out of the box in some deep senses. Aside from the time turner, he already has the power to expect anything he wants to expect of Dementors, and thereby cause them to act that way. No wand required.
The only barrier to this is that between him expecting the Dementors to do something and them actually doing it, there will be a period of time where he needs to stay alive, and while he is alive but held at wand-point he might be asked “have you betrayed me yet?” and have to admit that he had, and be killed.
All through chapter 112 Harry’s mental state was unprobed and Voldemort was distracted by the costs of arranging the Death Eaters and motivating them to help make and understand the vows and so on. The only time Harry’s mind was described by the narrator was during the casting of the unbreakable vow itself, to describe how a new “subscripted should” have come to exist in Harry’s brain. All of Chapter 113 seems like a lot of time for some mentally generated effects to have been put in motion.
7) He is a wizard with a wand. All the partial transfiguration stuff other people have mentioned is also relevant :-)
I agree that this task is far “easier task than a standard AI box experiment”. I attacked it from a different angle though (HarryPrime can easily and honestly convince Voldemort he is doomed unless HarryPrime helps him).:
In a sense, the story as of chapter 113 is an easier task than a standard AI box experiment, because HarryPrime has so many advantages over a human trying to play an AI trying to get out of a box.
Almost this exact scenario was discussed here, except without all the advantages that HarryPrime has.
1) He has parseltongue, so the listener is required to believe the literal meaning of everything he says, rather than discounting it as plausible lies. So much advantage here!
2) Voldemort put the equivalent of the “the AI in the box” next to a nearby time machine! Any predictable path that pulls a future HarryPrime into the present, saving present HarryPrime, and causing him to have the ability to go back in time and save himself, will happen. He could have time turned to some time before the binding, and not intervened because his future version is already HarryPrime and approves of HarryPrime coming into existence so HarryPrime can fulfill HarryPrime’s goals.
Now that this has happened, HarryPrime, in the moment of his creation, can establish any mental intent that puts him into alignment with HarryPrime’s larger outcome. There are limits, as there were when he escaped from being trapped in a locked room after Draco cast Gom Jabbar on him, by forming an intent to time travel and ask for rescuers to arrive just after his intent was formed.
The chronology has to be consistent, but there’s a lot of play here.
3) HarryPrime has been unbreakably bound to a task that the binder believes is good by a method the binder thinks he understands.
In a normal “ai box experiment” the gatekeeper hasn’t actually built the actual motivational structures of an actual AI. Instead, both humans are just pretending that the “boxed person” is really an AI and really has some or another goal, but they might be pretending differently. Thus, the person role-playing the AI can take very little for granted about what the gatekeeper things about “the AI’s” background intent and structure.
The only reason Voldemort has to distrust Harry is the prophesy.
The only “play” in the binding is that Voldemort seems to have chosen HarryPrime’s “supergoal content” poorly, so it probably doesn’t have the implications that Voldemort thinks it has, though this will only become apparent after several iterations.
HarryPrime is not dumb, and not especially ethical, so until he believes that Voldemort can no longer see the unanticipated implications of his actual request, he will seem to be pursuing the goals Voldemort should have asked for.
4) Voldemort (like an idiot, again after the previous failure to test the horcrux spells) has probably has never performed this sort of spell before, and probably doesn’t know what its likely psychological effects will be. He has probably never seen an implacably goal seeking agent before.
Humans, so far as I can tell, are mostly not implacably goal seeking. We wander around in action space, pursuing many competing “goals” that are really mostly tastes that evolution has given us, and role-based scripts we’ve picked up from ambient culture. We make complex tradeoffs between subjectively incommensurable things and make some forward progress, but much less than is theoretically possible for an effective and single mindedly strategic person.
HarryPrime has an unbreakable vow stripping away all these dithering tendencies. Thus HarryPrime, though probably abhuman at this point, should be able to conceal his abhumanity with relative ease, relying on Voldemort to treat him like a normal human with normal human motivational structures.
Voldemort is already making this error in using threat of torture of Harry’s parents to goad HarryPrime into telling Voldemort about “the power he knows not”.
I’m pretty sure that HarryPrime now only fundamentally cares about the torture of his parents to the degree that his unbreakable vow let’s him fall back on what his earlier self, and Hermione, would recommend or care about, and that clause only triggers when HarryPrime’s plans for world saving are themselves somewhat risky.
5) Harry has a huge amount of shared context and it recently contained a request for advice.
One thing HarryPrime could try is to suggest more ways to restrict himself, that to a normal human would be motivationally horrifying but to HarryPrime are still consistent with his new goal, and proves to Voldemort that he has mostly won already and killing Harry isn’t that critical.
Off the top of my head, a sneaky thing Harry might suggest is converting some of the death eaters into guards against Harry’s possible resurrection forever… using wording that will indirectly cause them also become x-risk mitigation robots as well.
6) Unlike an AI in the box, Harry is already out of the box in some deep senses. Aside from the time turner, he already has the power to expect anything he wants to expect of Dementors, and thereby cause them to act that way. No wand required.
The only barrier to this is that between him expecting the Dementors to do something and them actually doing it, there will be a period of time where he needs to stay alive, and while he is alive but held at wand-point he might be asked “have you betrayed me yet?” and have to admit that he had, and be killed.
All through chapter 112 Harry’s mental state was unprobed and Voldemort was distracted by the costs of arranging the Death Eaters and motivating them to help make and understand the vows and so on. The only time Harry’s mind was described by the narrator was during the casting of the unbreakable vow itself, to describe how a new “subscripted should” have come to exist in Harry’s brain. All of Chapter 113 seems like a lot of time for some mentally generated effects to have been put in motion.
7) He is a wizard with a wand. All the partial transfiguration stuff other people have mentioned is also relevant :-)
I agree that this task is far “easier task than a standard AI box experiment”. I attacked it from a different angle though (HarryPrime can easily and honestly convince Voldemort he is doomed unless HarryPrime helps him).:
http://lesswrong.com/r/discussion/lw/lsp/harry_potter_and_the_methods_of_rationality/c206