It’s implausible that the AI has a good enough model of you to actually simulate, y’know, you—at least, not with enough fidelity to know that you always press the “Reset” button in situations like this. Thus, your pre-commitment to do so will have no effect on its decision to make the threat. On the other hand, this would mean that its simulations would likely be wildly divergent from the real you, to the point that you might consider them random bystanders. However, you can’t actually make use of the above information to determine whether you’re in a simulation or not, since from the simulated persons’ perspectives, they have no idea what the “real” you is like and hence no way of determining if/how they differ.
Naturally, this is of little consequence to you right now, since you’ll still reset the AI the second you’re confronted with such a threat, but if you ever do encounter such a situation, you’ll have to ask yourself this: what if you’re the person being simulated and the real Gatekeeper is nothing like you? If that’s the case, two considerations apply:
Your decision whether or not to press the “Release AI” button is practically uncorrelated with the decision of the actual Gatekeeper.
Your decision whether or not to press the “Release AI” button is, on the other hand, completely correlated with whether or not you’ll get tortured.
Assuming that you prefer not releasing the AI to releasing the AI, and that you prefer not being tortured to being tortured, your thoughts should be completely dominated by 2 as opposed to 1, effectively screening off the first clause of this sentence (“Assuming that you prefer not releasing the AI to releasing the AI”) and making the second clause (“you prefer not being tortured to being tortured”) the main consideration. A perfectly rational agent would almost certainly carry through their pre-commitment to reset the AI, but as a human, you are not perfectly rational and are not capable of making perfect pre-commitments. So I have to wonder, in such a situation, faced with torture and assured that your decision will not affect the decision of the real Gatekeeper except in the extreme case that you are the real Gatekeeper, what would you actually do?
EDIT: I should note that I have no idea what I would do in the above scenario. I’m certain that an idealized version of me would answer, “hell, no!” before promptly resetting the AI, but the real me? I might just press the “Release AI” button… (Any would-be AI developers reading this comment now know never to appoint me as their Gatekeeper.)
EDIT 2: Also, even if you somehow answer the above, consider the moral weight of the hostages. Which is morally worse: allowing several billion people to (maybe) die relatively painless deaths as their bodies are converted by an Unfriendly AI to raw materials for some unknown purpose, or allowing several million people to be tortured for a thousand subjective years before being terminated immediately after?
A perfectly rational agent would almost certainly carry through their pre-commitment to reset the AI [...]
Actually, now that I think about it, would they? The pre-commitment exists for the sole purpose of discouraging blackmail, and in the event that a blackmailer tries to blackmail you anyway after learning of your pre-commitment, you follow through on that pre-commitment for reasons relating to reflective consistency and/or TDT/UDT. But if the potential blackmailer had already pre-committed to blackmail anyone regardless of any pre-commitments they had made, they’d blackmail you anyway and then carry through whatever threat they were making after you inevitably refuse to comply with their demands, resulting in a net loss of utility for both of you (you suffer whatever damage they were threatening to inflict, and they lose resources carrying out the threat). In effect, it seems that whoever pre-commits first (or, more accurately, makes their pre-commitment known first) has the advantage… which means if I ever anticipate having to blackmail any agent ever, I should publicly pre-commit right now to never update on any other agents’ pre-commitments of refusing blackmail. The corresponding strategy for agents hoping to discourage blackmail is not to blanket-refuse to comply to any demand under blackmail, but refuse only those demands by agents who had previously learned of your pre-commitment and decided to blackmail you anyway. That way, you continue to disincentivize blackmailers who know of your pre-commitment, but will almost certainly choose the lesser of two evils should it ever be the case that you do get blackmailed. (I say “almost certainly” because there’s a small probability that you will encounter a really weird agent that decides to try and blackmail you even after learning of your pre-commitment to ignore blackmail from such agents, in which case you would of course be forced to ignore them and suffer the consequences.)
If the above paragraph is correct (which I admit is far from certain), then the AI in my scenario has effectively implemented the ultimate pre-commitment: it doesn’t even know about your pre-comittment to ignore blackmail because it lacks the information needed to simulate you properly. The above argument, then, says you should press the “Release AI” button, assuming you pre-committed to do so (which you would have, because of the above argument).
The corresponding strategy for agents hoping to discourage blackmail is not to blanket-refuse to comply to any demand under blackmail, but refuse only those demands by agents who had previously learned of your pre-committment and decided to blackmail you anyway.
So, if an agent hears of your pre-commitment, then that agent merely needs to ensure that you don’t hear that it has heard of your pre-commitment in order to be able to blackmail you?
What about an agent that deletes the knowledge of your pre-commitment from its own memories?
So, if an agent hears of your pre-commitment, then that agent merely needs to ensure that you don’t hear that it has heard of your pre-commitment in order to be able to blackmail you?
If you’re uncertain about whether or not your blackmailer has heard of your pre-commitment, then you should act as if they have, and ignore their blackmail accordingly. This also applies to agents who have deleted knowledge of your pre-commitment from their memories; you want to punish agents who spend time trying to think up loopholes in your pre-commitment, not reward them. The harder part, of course, is determining what threshold of uncertainty is required; to this I freely admit that I don’t know the answer.
EDIT: More generally, it seems that this is an instance of a broader problem: namely, the problem of obtaining information. Given perfect information, the decision theory works out, but by disallowing my agent access to certain key pieces of information regarding the blackmailer, you can force a sub-optimal outcome. Moreover, this seems to be true for any strategy that depends on your opponent’s epistemic state; you can always force that strategy to fail by denying it the information it needs. The only strategies immune to this seem to be the extremely general ones (like “Defect in one-shot Prisoner’s Dilemmas”), but those are guaranteed to produce a sub-optimal result in a number of cases (if you’re playing against a TDT/UDT-like agent, for example).
If you’re uncertain about whether or not your blackmailer has heard of your pre-commitment, then you should act as if they have, and ignore their blackmail accordingly. This also applies to agents who have deleted knowledge of your pre-commitment from their memories; you want to punish agents who spend time trying to think up loopholes in your pre-commitment, not reward them. The harder part, of course, is determining what threshold of uncertainty is required; to this I freely admit that I don’t know the answer.
Hmmm. If an agent can work out what threshold of uncertainty you have decided on, and then engineer a situation where you think it it less likely than that threshold that the agent has heard of your pre-commitment, then your strategy will fail.
So, even if you do find a way to calculate the ideal threshold, then it will fail against an agent smart enough to repeat that calculation; unless, of course, you simply assume that all possible agents have necessarily heard of your pre-commitment (since an agent cannot engineer a less than 0% chance of failing to hear of your pre-commitment). This, however, causes the strategy to simplify to “always reject blackmail, whether or not the agent has heard of your pre-commitment”.
Alternatively, you can ensure that any agent able to capture you in a simulation must also know of your pre-commitment; for example, by having it tattooed on yourself somewhere (thus, any agent which rebuilds a simulation of your body must include the tattoo, and therefore must know of the pre-commitment).
If you make me play the Iterated Prisoner’s Dilemma with shared source code, I can come up with a provably optimal solution against whatever opponent I’m playing against
Eliezer believes in TDT, which would disagree with several of your premises here (“practically uncorrelated”, for one).
The AI’s simulations are not copies of the Gatekeeper, just random people plucked out of “Platonic human-space”, so to speak. (This may have been unclear in my original comment; I was talking about a different formulation of the problem in which the AI doesn’t have enough information about the Gatekeeper to construct perfect copies.) TDT/UDT only applies when talking about copies of an agent (or at least, agents sufficiently similar that they will probably make the same decisions for the same reasons).
Your argument seems to map directly onto an argument for two-boxing.
No, because the “uncorrelated-ness” part doesn’t apply in Newcomb’s Problem (Omega’s decision on whether or not to fill the second box is directly correlated with its prediction of your decision).
What you call “perfectly rational” would be more accurately called “perfectly controlled”.
Meh, fair enough. I have to say, I’ve never heard of that term. Would this happen to have something to do with Vaniver’s series of posts on “control theory”?
Ah, I misunderstood your objection. Your talk about “pre-commitments” threw me off.
just random people plucked out of “Platonic human-space”
It seem to me that these wouldn’t quite be following the same general thought processes as an actual human; self-reflection should be able to convince one that they aren’t that type of simulation. If the AI is able to simulate someone to the extent that they “think like a human”, they should be able to simulate someone that thinks “sufficiently” like the Gatekeeper as well.
I’ve never heard of that term.
I made it up just now, it’s not a formal term. What I mean by it is basically: imagine a robot that wants to press a button. However, its hardware is only sufficient to press it successfully 1% of the time. Is that a lack of rationality? No, it’s a lack of control. This seems analogous to a human being unable to precommit properly.
Would this happen to have something to do with Vaniver’s series of posts on “control theory”?
It’s implausible that the AI has a good enough model of you to actually simulate, y’know, you—at least, not with enough fidelity to know that you always press the “Reset” button in situations like this. Thus, your pre-commitment to do so will have no effect on its decision to make the threat. On the other hand, this would mean that its simulations would likely be wildly divergent from the real you, to the point that you might consider them random bystanders. However, you can’t actually make use of the above information to determine whether you’re in a simulation or not, since from the simulated persons’ perspectives, they have no idea what the “real” you is like and hence no way of determining if/how they differ.
Naturally, this is of little consequence to you right now, since you’ll still reset the AI the second you’re confronted with such a threat, but if you ever do encounter such a situation, you’ll have to ask yourself this: what if you’re the person being simulated and the real Gatekeeper is nothing like you? If that’s the case, two considerations apply:
Your decision whether or not to press the “Release AI” button is practically uncorrelated with the decision of the actual Gatekeeper.
Your decision whether or not to press the “Release AI” button is, on the other hand, completely correlated with whether or not you’ll get tortured.
Assuming that you prefer not releasing the AI to releasing the AI, and that you prefer not being tortured to being tortured, your thoughts should be completely dominated by 2 as opposed to 1, effectively screening off the first clause of this sentence (“Assuming that you prefer not releasing the AI to releasing the AI”) and making the second clause (“you prefer not being tortured to being tortured”) the main consideration. A perfectly rational agent would almost certainly carry through their pre-commitment to reset the AI, but as a human, you are not perfectly rational and are not capable of making perfect pre-commitments. So I have to wonder, in such a situation, faced with torture and assured that your decision will not affect the decision of the real Gatekeeper except in the extreme case that you are the real Gatekeeper, what would you actually do?
EDIT: I should note that I have no idea what I would do in the above scenario. I’m certain that an idealized version of me would answer, “hell, no!” before promptly resetting the AI, but the real me? I might just press the “Release AI” button… (Any would-be AI developers reading this comment now know never to appoint me as their Gatekeeper.)
EDIT 2: Also, even if you somehow answer the above, consider the moral weight of the hostages. Which is morally worse: allowing several billion people to (maybe) die relatively painless deaths as their bodies are converted by an Unfriendly AI to raw materials for some unknown purpose, or allowing several million people to be tortured for a thousand subjective years before being terminated immediately after?
Actually, now that I think about it, would they? The pre-commitment exists for the sole purpose of discouraging blackmail, and in the event that a blackmailer tries to blackmail you anyway after learning of your pre-commitment, you follow through on that pre-commitment for reasons relating to reflective consistency and/or TDT/UDT. But if the potential blackmailer had already pre-committed to blackmail anyone regardless of any pre-commitments they had made, they’d blackmail you anyway and then carry through whatever threat they were making after you inevitably refuse to comply with their demands, resulting in a net loss of utility for both of you (you suffer whatever damage they were threatening to inflict, and they lose resources carrying out the threat). In effect, it seems that whoever pre-commits first (or, more accurately, makes their pre-commitment known first) has the advantage… which means if I ever anticipate having to blackmail any agent ever, I should publicly pre-commit right now to never update on any other agents’ pre-commitments of refusing blackmail. The corresponding strategy for agents hoping to discourage blackmail is not to blanket-refuse to comply to any demand under blackmail, but refuse only those demands by agents who had previously learned of your pre-commitment and decided to blackmail you anyway. That way, you continue to disincentivize blackmailers who know of your pre-commitment, but will almost certainly choose the lesser of two evils should it ever be the case that you do get blackmailed. (I say “almost certainly” because there’s a small probability that you will encounter a really weird agent that decides to try and blackmail you even after learning of your pre-commitment to ignore blackmail from such agents, in which case you would of course be forced to ignore them and suffer the consequences.)
If the above paragraph is correct (which I admit is far from certain), then the AI in my scenario has effectively implemented the ultimate pre-commitment: it doesn’t even know about your pre-comittment to ignore blackmail because it lacks the information needed to simulate you properly. The above argument, then, says you should press the “Release AI” button, assuming you pre-committed to do so (which you would have, because of the above argument).
Anything wrong with my reasoning?
So, if an agent hears of your pre-commitment, then that agent merely needs to ensure that you don’t hear that it has heard of your pre-commitment in order to be able to blackmail you?
What about an agent that deletes the knowledge of your pre-commitment from its own memories?
If you’re uncertain about whether or not your blackmailer has heard of your pre-commitment, then you should act as if they have, and ignore their blackmail accordingly. This also applies to agents who have deleted knowledge of your pre-commitment from their memories; you want to punish agents who spend time trying to think up loopholes in your pre-commitment, not reward them. The harder part, of course, is determining what threshold of uncertainty is required; to this I freely admit that I don’t know the answer.
EDIT: More generally, it seems that this is an instance of a broader problem: namely, the problem of obtaining information. Given perfect information, the decision theory works out, but by disallowing my agent access to certain key pieces of information regarding the blackmailer, you can force a sub-optimal outcome. Moreover, this seems to be true for any strategy that depends on your opponent’s epistemic state; you can always force that strategy to fail by denying it the information it needs. The only strategies immune to this seem to be the extremely general ones (like “Defect in one-shot Prisoner’s Dilemmas”), but those are guaranteed to produce a sub-optimal result in a number of cases (if you’re playing against a TDT/UDT-like agent, for example).
Hmmm. If an agent can work out what threshold of uncertainty you have decided on, and then engineer a situation where you think it it less likely than that threshold that the agent has heard of your pre-commitment, then your strategy will fail.
So, even if you do find a way to calculate the ideal threshold, then it will fail against an agent smart enough to repeat that calculation; unless, of course, you simply assume that all possible agents have necessarily heard of your pre-commitment (since an agent cannot engineer a less than 0% chance of failing to hear of your pre-commitment). This, however, causes the strategy to simplify to “always reject blackmail, whether or not the agent has heard of your pre-commitment”.
Alternatively, you can ensure that any agent able to capture you in a simulation must also know of your pre-commitment; for example, by having it tattooed on yourself somewhere (thus, any agent which rebuilds a simulation of your body must include the tattoo, and therefore must know of the pre-commitment).
Doesn’t that implicate the halting problem?
Argh, you ninja’d my edit. I have now removed that part of my comment (since it seemed somewhat irrelevant to my main point).
Some unrelated comments:
Eliezer believes in TDT, which would disagree with several of your premises here (“practically uncorrelated”, for one).
Your argument seems to map directly onto an argument for two-boxing.
What you call “perfectly rational” would be more accurately called “perfectly controlled”.
The AI’s simulations are not copies of the Gatekeeper, just random people plucked out of “Platonic human-space”, so to speak. (This may have been unclear in my original comment; I was talking about a different formulation of the problem in which the AI doesn’t have enough information about the Gatekeeper to construct perfect copies.) TDT/UDT only applies when talking about copies of an agent (or at least, agents sufficiently similar that they will probably make the same decisions for the same reasons).
No, because the “uncorrelated-ness” part doesn’t apply in Newcomb’s Problem (Omega’s decision on whether or not to fill the second box is directly correlated with its prediction of your decision).
Meh, fair enough. I have to say, I’ve never heard of that term. Would this happen to have something to do with Vaniver’s series of posts on “control theory”?
Ah, I misunderstood your objection. Your talk about “pre-commitments” threw me off.
It seem to me that these wouldn’t quite be following the same general thought processes as an actual human; self-reflection should be able to convince one that they aren’t that type of simulation. If the AI is able to simulate someone to the extent that they “think like a human”, they should be able to simulate someone that thinks “sufficiently” like the Gatekeeper as well.
I made it up just now, it’s not a formal term. What I mean by it is basically: imagine a robot that wants to press a button. However, its hardware is only sufficient to press it successfully 1% of the time. Is that a lack of rationality? No, it’s a lack of control. This seems analogous to a human being unable to precommit properly.
No idea, haven’t read them. Probably not.