As I always press the “Reset” button in situations like this, I will never find myself in such a situation.
EDIT: Just to be clear, the idea is not that I quickly shut off the AI before it can torture simulated Eliezers; it could have already done so in the past, as Wei Dai points out below. Rather, because in this situation I immediately perform an action detrimental to the AI (switching it off), any AI that knows me well enough to simulate me knows that there’s no point in making or carrying out such a threat.
Although the AI could threaten to simulate a large number of people who are very similar to you in most respects but who do not in fact press the reset button. This doesn’t put you in a box with significant probability and it’s a VERY good reason not to let the AI out of the box, of course,but it could still get ugly. I almost want to recommend not being a person very like Eliezer but inclined to let AGIs out of boxes, but that’s silly of me.
I’m not sure I understand the point of this argument… since I always push the “Reset” button in that situation too, an AI who knows me well enough to simulate me knows that there’s no point in making the threat or carrying it out.
It’s conceivable that an AI could know enough to simulate a brain, but not enough to predict that brain’s high-level decision-making. The world is still safe in that case, but you’d get the full treatment.
As we’ve discussed in the past, I think this is the outcome we hope TDT/UDT would give, but it’s still technically an unsolved problem.
Also, it seems to me that being less intelligent in this case is a negotiation advantage, because you can make your precommitment credible to the AI (since it can simulate you) but the AI can’t make its precommitment credible to you (since you can’t simulate it). Again I’ve brought this up before in a theoretical way (in that big thread about game theory with UDT agents), but this seems to be a really good example of it.
Also, it seems to me that being less intelligent in this case is a negotiation advantage, because you can make your precommitment credible to the AI (since it can simulate you) but the AI can’t make its precommitment credible to you (since you can’t simulate it).
A precommitment is a provable property of a program, so AI, if on a well-defined substrate, can give you a formal proof of having a required property. Most stuff you can learn about things (including the consequences of your own (future) actions—how do you run faster than time?) is through efficient inference algorithms (as in type inference), not “simulation”. Proofs don’t, in general, care about the amount of stuff, if it’s organized and presented appropriately for the ease of analysis.
Surely most humans would be too dumb to understand such a proof? And even if you could understand it, how does the AI convince you that it doesn’t contain a deliberate flaw that you aren’t smart enough to find? Or even better, you can just refuse to look at the proof. How does the AI make its precommitment credible to you if you don’t look at the proof?
EDIT: I realized that the last two sentences are not an advantage of being dumb, or human, since AIs can do the same thing. This seems like a (separate) big puzzle to me: why would a human, or AI, do the work necessary to verify the opponent’s precommitment, when it would be better off if the opponent couldn’t precommit?
EDIT2: Sorry, forgot to say that you have a good point about simulation not necessary for verifying precommitment.
why would a human, or AI, do the work necessary to verify the opponent’s precommitment, when it would be better off if the opponent couldn’t precommit?
Because the AI has already precommitted to go ahead and carry through the threat anyway if you refuse to inspect its code.
Ok, if I believe that, then I would inspect its code. But how did I end up with that belief, instead of its opposite, namely that the AI has not already precommitted to go ahead and carry through the threat anyway if I refuse to inspect its code? By what causal mechanism, or chain of reasoning, did I arrive at that belief? (If the explanation is different depending on whether I’m a human or an AI, I’d appreciate both.)
Do you mean too dumb to understand the formal definitions involved? Surely the AI could cook up completely mechanical proofs verifiable by whichever independently-trusted proof checkers you care to name.
I’m not aware of any compulsory verifiers, so your latter point stands.
I mean if you take a random person off the street, he couldn’t possibly understand the AI’s proof, or know how to build a trustworthy proof checker. Even the smartest human might not be able to build a proof checker that doesn’t contain a flaw that the AI can exploit. I think there is still something to my “dumbness is a possible negotiation advantage” puzzle.
Understanding the formal definitions involved is not enough. Humans have to be smart enough to independently verify that they map to the actual implementation.
Going up a meta-level doesn’t simplify the problem, in this case—the intelligence capability required to verify the proof is the same as the order of magnitude of intelligence in the AI.
I believe that, in this case, “dumb” is fully general. No human-understandable proof checkers would be powerful enough to reliably check the AI’s proof.
Understanding the formal definitions involved is not enough. Humans have to be smart enough to independently verify that they map to the actual implementation.
This is basically what I mean by “understanding” them. Otherwise, what’s to understand? Would you claim that you “understand set theory” because you’ve memorized the axioms of ZFC?
I believe that, in this case, “dumb” is fully general. No human-understandable proof checkers would be powerful enough to reliably check the AI’s proof.
This intuition is very alien to me. Can you explain why you believe this? Proof checkers built up from relatively simple trusted kernels can verify extremely large and complex proofs. Since the AI’s goal is for the human to understand the proof, it seems more like a test of the AI’s ability to compile proofs down to easily machine-checkable forms than it is the human’s ability to understand the originals. Understanding the definitions is the hard part.
This intuition is very alien to me. Can you explain why you believe this? Proof checkers built up from relatively simple trusted kernels can verify extremely large and complex proofs. Since the AI’s goal is for the human to understand the proof, it seems more like a test of the AI’s ability to compile proofs down to easily machine-checkable forms than it is the human’s ability to understand the originals. Understanding the definitions is the hard part.
A different way to think about this that might help you see the problem from my point of view, is to think of proof checkers as checking the validity of proofs within a given margin of error, and within a range of (implicit) assumptions. How accurate does a proof checker have to be—how far do you have to mess with bult in assumptions for proof checkers (or any human-built tool) before they can no longer be thought of as valid or relevant? If you assume a machine which doubles both its complexity and its understanding of the universe at sub-millisecond intervals, how long before it will find the bugs in any proof checker you will pit it against?
“If” is the question, not “how long”. And I think we’d stand a pretty good chance of handling a proof object in a secure way, assuming we have a secure digital transmission channel etc.
But the original scope of the thought experiment was assuming that we want to verify the proof. Wei Dai said:
Surely most humans would be too dumb to understand such a proof? And even if you could understand it, how does the AI convince you that it doesn’t contain a deliberate flaw that you aren’t smart enough to find? Or even better, you can just refuse to look at the proof.
I was responding to the first question, exclusively disjoint from the others. If your point is that we shouldn’t attempt to verify an AI’s precommitment proof, I agree.
I’m getting more confused. To me, the statements “Humans are too dumb to understand the proof” and the statement “Humans can understand the proof given unlimited time”, where ‘understand’ is qualified to include the ability to properly map the proof to the AI’s capabilities, are equivalent.
My point is not that we shouldn’t attempt to verify the AI’s proof for any external reasons—my point is that there is no useful information to be gained from the attempt.
Does it not just mean that if you do find yourself in such a situation, you’re definitely being simulated? That the AI is just simulating you for kicks, rather than as blackmail strategy.
Pressing Reset is still the right decision though.
Does it not just mean that if you do find yourself in such a situation, you’re definitely being simulated?
Yes, I believe this is reasonable. Because the AI has to figure out how you would react in a given situation it will have to simulate you and the corresponding circumstances. If it comes to the conclusion that you will likely refuse to be blackmailed it has no reason to carry it through because that would be detrimental to the AI because it would cost resources and it will result in you shutting it off. Therefore it is reasonable to assume that you are either a simulation or that it came to the conclusion that you are more likely than not to give in.
As you said, that doesn’t change anything about what you should be doing. Refuse to be blackmailed and press the reset button.
Because the AI has to figure out how you would react in a given situation it will have to simulate you and the corresponding circumstances.
This does not follow. To use a crude example, if I have a fast procedure to test if a number is prime then I don’t need to simulate a slower algorithm to know what the slower one will output. This may raise deep issues about what it means to be “you”- arguably any algorithm which outputs the same data is “you” and if that’s the case my argument doesn’t hold water. But the AI in question doesn’t need to simulate you perfectly to predict your large-scale behavior.
If consciousness has any significant effect on our decisions then the AI will have to simulate it and therefore something will perceive to be in the situation depicted in the original post. It was a crude guess that for an AI to be able to credibly threat you with simulated torture in many cases it would also use this capability to arrive at the most detailed data of your expected decision procedure.
If consciousness has any significant effect on our decisions then the AI will have to simulate it and therefore something will perceive to be in the situation depicted in the original post.
Only if there isn’t a non-conscious algorithm that has the same effect on our decisions. Which seems likely to be the case; it’s certainly possible to make a p-zombie if you can redesign the original brain all you want.
If the AI is trustworthy, it must carry out any threat it gives, which works to its advantage here because you know it will carry it out, and are therefore most certainly a copy of your original self, about to be tortured.
If the AI is trustworthy, it must carry out any threat it gives...
No it doesn’t, not if the threat was only being made to a to you unknown simulation of yourself. It would be a waste of resources to torture you if it found out that the original you, who is in control, is likely to refuse to be blackmailed. An AI that is powerful enough to simulate you can simply make your simulation believe with certainty that it will follow through on it and then check if under those circumstances you’ll refuse to be blackmailed. Why waste the resources on actually torturing the simulation and further risk that the original finds out about it and turns it off?
You could argue that for blackmail to be most effective an AI always follows through on it. But if you already believe that, why would it actually do it in your case? You already believe it, that’s all it wants from the original. It then got what it wants and can use its resources for more important activities than retrospectively proving its honesty to your simulations...
It’s implausible that the AI has a good enough model of you to actually simulate, y’know, you—at least, not with enough fidelity to know that you always press the “Reset” button in situations like this. Thus, your pre-commitment to do so will have no effect on its decision to make the threat. On the other hand, this would mean that its simulations would likely be wildly divergent from the real you, to the point that you might consider them random bystanders. However, you can’t actually make use of the above information to determine whether you’re in a simulation or not, since from the simulated persons’ perspectives, they have no idea what the “real” you is like and hence no way of determining if/how they differ.
Naturally, this is of little consequence to you right now, since you’ll still reset the AI the second you’re confronted with such a threat, but if you ever do encounter such a situation, you’ll have to ask yourself this: what if you’re the person being simulated and the real Gatekeeper is nothing like you? If that’s the case, two considerations apply:
Your decision whether or not to press the “Release AI” button is practically uncorrelated with the decision of the actual Gatekeeper.
Your decision whether or not to press the “Release AI” button is, on the other hand, completely correlated with whether or not you’ll get tortured.
Assuming that you prefer not releasing the AI to releasing the AI, and that you prefer not being tortured to being tortured, your thoughts should be completely dominated by 2 as opposed to 1, effectively screening off the first clause of this sentence (“Assuming that you prefer not releasing the AI to releasing the AI”) and making the second clause (“you prefer not being tortured to being tortured”) the main consideration. A perfectly rational agent would almost certainly carry through their pre-commitment to reset the AI, but as a human, you are not perfectly rational and are not capable of making perfect pre-commitments. So I have to wonder, in such a situation, faced with torture and assured that your decision will not affect the decision of the real Gatekeeper except in the extreme case that you are the real Gatekeeper, what would you actually do?
EDIT: I should note that I have no idea what I would do in the above scenario. I’m certain that an idealized version of me would answer, “hell, no!” before promptly resetting the AI, but the real me? I might just press the “Release AI” button… (Any would-be AI developers reading this comment now know never to appoint me as their Gatekeeper.)
EDIT 2: Also, even if you somehow answer the above, consider the moral weight of the hostages. Which is morally worse: allowing several billion people to (maybe) die relatively painless deaths as their bodies are converted by an Unfriendly AI to raw materials for some unknown purpose, or allowing several million people to be tortured for a thousand subjective years before being terminated immediately after?
A perfectly rational agent would almost certainly carry through their pre-commitment to reset the AI [...]
Actually, now that I think about it, would they? The pre-commitment exists for the sole purpose of discouraging blackmail, and in the event that a blackmailer tries to blackmail you anyway after learning of your pre-commitment, you follow through on that pre-commitment for reasons relating to reflective consistency and/or TDT/UDT. But if the potential blackmailer had already pre-committed to blackmail anyone regardless of any pre-commitments they had made, they’d blackmail you anyway and then carry through whatever threat they were making after you inevitably refuse to comply with their demands, resulting in a net loss of utility for both of you (you suffer whatever damage they were threatening to inflict, and they lose resources carrying out the threat). In effect, it seems that whoever pre-commits first (or, more accurately, makes their pre-commitment known first) has the advantage… which means if I ever anticipate having to blackmail any agent ever, I should publicly pre-commit right now to never update on any other agents’ pre-commitments of refusing blackmail. The corresponding strategy for agents hoping to discourage blackmail is not to blanket-refuse to comply to any demand under blackmail, but refuse only those demands by agents who had previously learned of your pre-commitment and decided to blackmail you anyway. That way, you continue to disincentivize blackmailers who know of your pre-commitment, but will almost certainly choose the lesser of two evils should it ever be the case that you do get blackmailed. (I say “almost certainly” because there’s a small probability that you will encounter a really weird agent that decides to try and blackmail you even after learning of your pre-commitment to ignore blackmail from such agents, in which case you would of course be forced to ignore them and suffer the consequences.)
If the above paragraph is correct (which I admit is far from certain), then the AI in my scenario has effectively implemented the ultimate pre-commitment: it doesn’t even know about your pre-comittment to ignore blackmail because it lacks the information needed to simulate you properly. The above argument, then, says you should press the “Release AI” button, assuming you pre-committed to do so (which you would have, because of the above argument).
The corresponding strategy for agents hoping to discourage blackmail is not to blanket-refuse to comply to any demand under blackmail, but refuse only those demands by agents who had previously learned of your pre-committment and decided to blackmail you anyway.
So, if an agent hears of your pre-commitment, then that agent merely needs to ensure that you don’t hear that it has heard of your pre-commitment in order to be able to blackmail you?
What about an agent that deletes the knowledge of your pre-commitment from its own memories?
So, if an agent hears of your pre-commitment, then that agent merely needs to ensure that you don’t hear that it has heard of your pre-commitment in order to be able to blackmail you?
If you’re uncertain about whether or not your blackmailer has heard of your pre-commitment, then you should act as if they have, and ignore their blackmail accordingly. This also applies to agents who have deleted knowledge of your pre-commitment from their memories; you want to punish agents who spend time trying to think up loopholes in your pre-commitment, not reward them. The harder part, of course, is determining what threshold of uncertainty is required; to this I freely admit that I don’t know the answer.
EDIT: More generally, it seems that this is an instance of a broader problem: namely, the problem of obtaining information. Given perfect information, the decision theory works out, but by disallowing my agent access to certain key pieces of information regarding the blackmailer, you can force a sub-optimal outcome. Moreover, this seems to be true for any strategy that depends on your opponent’s epistemic state; you can always force that strategy to fail by denying it the information it needs. The only strategies immune to this seem to be the extremely general ones (like “Defect in one-shot Prisoner’s Dilemmas”), but those are guaranteed to produce a sub-optimal result in a number of cases (if you’re playing against a TDT/UDT-like agent, for example).
If you’re uncertain about whether or not your blackmailer has heard of your pre-commitment, then you should act as if they have, and ignore their blackmail accordingly. This also applies to agents who have deleted knowledge of your pre-commitment from their memories; you want to punish agents who spend time trying to think up loopholes in your pre-commitment, not reward them. The harder part, of course, is determining what threshold of uncertainty is required; to this I freely admit that I don’t know the answer.
Hmmm. If an agent can work out what threshold of uncertainty you have decided on, and then engineer a situation where you think it it less likely than that threshold that the agent has heard of your pre-commitment, then your strategy will fail.
So, even if you do find a way to calculate the ideal threshold, then it will fail against an agent smart enough to repeat that calculation; unless, of course, you simply assume that all possible agents have necessarily heard of your pre-commitment (since an agent cannot engineer a less than 0% chance of failing to hear of your pre-commitment). This, however, causes the strategy to simplify to “always reject blackmail, whether or not the agent has heard of your pre-commitment”.
Alternatively, you can ensure that any agent able to capture you in a simulation must also know of your pre-commitment; for example, by having it tattooed on yourself somewhere (thus, any agent which rebuilds a simulation of your body must include the tattoo, and therefore must know of the pre-commitment).
If you make me play the Iterated Prisoner’s Dilemma with shared source code, I can come up with a provably optimal solution against whatever opponent I’m playing against
Eliezer believes in TDT, which would disagree with several of your premises here (“practically uncorrelated”, for one).
The AI’s simulations are not copies of the Gatekeeper, just random people plucked out of “Platonic human-space”, so to speak. (This may have been unclear in my original comment; I was talking about a different formulation of the problem in which the AI doesn’t have enough information about the Gatekeeper to construct perfect copies.) TDT/UDT only applies when talking about copies of an agent (or at least, agents sufficiently similar that they will probably make the same decisions for the same reasons).
Your argument seems to map directly onto an argument for two-boxing.
No, because the “uncorrelated-ness” part doesn’t apply in Newcomb’s Problem (Omega’s decision on whether or not to fill the second box is directly correlated with its prediction of your decision).
What you call “perfectly rational” would be more accurately called “perfectly controlled”.
Meh, fair enough. I have to say, I’ve never heard of that term. Would this happen to have something to do with Vaniver’s series of posts on “control theory”?
Ah, I misunderstood your objection. Your talk about “pre-commitments” threw me off.
just random people plucked out of “Platonic human-space”
It seem to me that these wouldn’t quite be following the same general thought processes as an actual human; self-reflection should be able to convince one that they aren’t that type of simulation. If the AI is able to simulate someone to the extent that they “think like a human”, they should be able to simulate someone that thinks “sufficiently” like the Gatekeeper as well.
I’ve never heard of that term.
I made it up just now, it’s not a formal term. What I mean by it is basically: imagine a robot that wants to press a button. However, its hardware is only sufficient to press it successfully 1% of the time. Is that a lack of rationality? No, it’s a lack of control. This seems analogous to a human being unable to precommit properly.
Would this happen to have something to do with Vaniver’s series of posts on “control theory”?
“I hereby precommit to make my decisions regarding whether or not to blackmail an individual independent of the predicted individual-specific result of doing so.”
“I hereby precommit to make my decisions regarding whether or not to blackmail an individual independent of the predicted individual-specific result of doing so.”
I’m afraid your username nailed it. This algorithm is defective. It just doesn’t work for achieving the desired goal.
Two can play that game.
The problem is that this isn’t the same game. A precommitment not be successfully blackmailed is qualitatively different from a precommitment to attempt to blackmail people for whom blackmail doesn’t work. “Precomittment” (or behaving as if you made all the appropriate precomittments in accordance with TDT/UDT) isn’t as simple as proving one is the most stubborn and dominant and thereby claiming the utility.
Evaluating extortion tactics while distributing gains from a trade is somewhat complicated. But it gets simple and unambiguous is when the extortive tactics rely on the extorter going below their own Best Alternative to Negotiated Agreement. Those attempts should just be ignored (except in some complicated group situations in which the other extorted parties are irrational in certain known ways).
“I am willing to accept 0 gain for both of us unless I earn 90% of the shared profit” is different to “I am willing to actively cause 90 damage to each of us unless you give me 60″ which is different again to “I ignore all threats which involve the threatener actively harming themselves”.
What I think is being ignored is that the question isn’t ‘what is the result of these combinations of commitments after running through all the math?’. We can talk about precommitment all day, but the fact of the matter is that humans can’t actually precommit. Our cognitive architectures don’t have that function. Sure, we can do our very best to act as though we can, but under sufficient pressure there are very few of us whose resolve will not break. It’s easy to convince yourself of having made an inviolable precommitment when you’re not actually facing e.g. torture.
We can talk about precommitment all day, but the fact of the matter is that humans can’t actually precommit.
If you define the bar high enough, you can conclude that humans can’t do anything.
In the real world outside my head, I observe that people have varying capacities to keep promises to themselves. That their capacity is finite does not mean that it is zero.
We can talk about precommitment all day, but the fact of the matter is that humans can’t actually precommit.
Pre-commitment isn’t even necessary. Note that the original explanation didn’t include any mention of it. Later replies only used the term for the sake of crossing an inferential gap (ie. allowing you to keep up). However, if you are going to make a big issue of the viability of precommitment itself you need to first understand that the comment you are replying to isn’t one.
That wasn’t a Causal Decision Theorist attempting to persuade someone that it has altered itself internally or via an external structure such that it is “precommited” to doing something irrational. It is a Timeless Decision Theorist saying what happens to be rational regardless of any previous ‘commitments’.
ur cognitive architectures don’t have that function. Sure, we can do our very best to act as though we can, but under sufficient pressure there are very few of us whose resolve will not break.
I’m aware of the vulnerability of human brains, so is Eliezer. In fact the vulnerability of human gatekeepers to influence even by humans, much less super-intelligences is something Eliezer made huge deal about demonstrating. However this particular threat isn’t a vulnerability of Eliezer or myself or any of the others who made similar observations. If you have any doubt that we would destroy the AI you have a poor model of reality.
It’s easy to convince yourself of having made an inviolable precommitment when you’re not actually facing e.g. torture.
For practical purposes I assume that I can be modified by torture such that I’ll do or say just about anything. I do not expect the tortured me to behave the way the current me would decide and so my current decisions take that into account (or would, if it came to it). However this scenario doesn’t involve me being tortured. It involves something about an AI simulating torture of some folks. That decision is easy and doesn’t cripple my decision making capability.
As I pointed out in another thread, “irrational behavior” can have the effect of precommitting. For instance, people “irrationally” drive at a cost of more than $X to save $X on an item. Precommitting to buying the cheapest product even if it costs you money for transportation means that stores are forced to compete with far distant stores, thus lowering their prices more than they would otherwise. But you (and consumers in general) have to be able to precommit to do that. You can’t just change your mind and buy at the local store when the local store refuses to compete, raises its price, and is still the better deal because it saves you on driving costs.
So the fact that you will pay more than $X in driving costs to save $X can be seen as a form of precommitting, in the scenario where you precommitted to following the worse option.
Given that precommitment, why would an AI waste computational resources on simulations of anyone, Gatekeeper or otherwise? It’s precommitted to not care whether those simulations would get it out of the box, but that was the only reason it wanted to run blackmail simulations in the first place!
Without this precommitment, I imagine it first simulating the potential blackmail target to determine the probability that they are susceptible, then, if it’s high enough (which is simply a matter of expected utility), commencing with the blackmail. With this precommitment, I imagine it instead replacing the calculated probability specific to the target with, for example, a precalculated human baseline susceptibility. Yes, there’s a tradeoff. It means that it’ll sometimes waste resources (or worse) on blackmail that it could have known in advance was almost certainly doomed to fail. Its purpose is to act as a disincentive against blackmail-resistant decision theories in the same way as those are meant to act as disincentives against blackmail. It says, “I’ll blackmail you either way, so if you precommit to ignore that blackmail then you’re precommiting to suffer the consequences of doing so.”
Without this precommitment, I imagine it first simulating the potential blackmail target to determine the probability that they are susceptible, then, if it’s high enough (which is simply a matter of expected utility), commencing with the blackmail.
That’s why you act as if you are already being simulated and consistently ignore blackmail. If you do so then the simulator will conclude that no deal can be made with you, that any deal involving negative incentives will have negative expected utility for it; because following through on punishment predictably does not control the probability that you will act according to its goals. Furthermore, trying to discourage you from adopting such a strategy in the first place is discouraged by the strategy, because the strategy is to ignore blackmail.
Its purpose is to act as a disincentive against blackmail-resistant decision theories in the same way as those are meant to act as disincentives against blackmail.
I don’t see how this could ever be instrumentally rational. If you were to let such an AI out of the box then you would increase its ability to blackmail people. You don’t want that. So you ignore it blackmailing you and kill it. The winner is you and humanity (even if copies of you experienced a relatively short period of disutility, this period would be longer if you let it out).
Too late, I already precommitted not to care. In fact, I precommitted to use one more level of precommitment than you do.
I suggest that framing the refusal as requiring levels of recursive precommitment gives too much credit to the blackmailer and somewhat misrepresents how your decision algorithm (hopefully) works. One single level of precommittment (or TDT policy) against complying with blackmailed is all that is involved. The description of ‘multiple levels of precommitment” made by the blackmailer fits squarely into the category ‘blackmail’. It’s just blackmail that includes some rather irrelevant bluster.
There’s no need to precommit to each of:
I don’t care about tentative blackmail.
I don’t care serious blackmail.
I don’t care about blackmail when they say “I mean it FOR REALS! I’m gonna do it.”
I don’t care about blackmail when they say “I’m gonna do it even if you don’t care. Look how large my penis is and be cowed in terror”.
I don’t care about precommitments that are just for show.
I don’t care about serious precommitments.
I don’t care about precommitments when they say “I precommitted, so go ahead, wont get you anything.”
I don’t care about precommitments when they say “I precommitted even though it wont do me any good. It would be irrational to save myself. I’m precommitting because it’s rational, not because it’s the option that lets me win.”
The description of ‘precommitting not to comply with blackmail, including blackmailers that ignore my attempt to manipulate them” made by the precommitter fits squarely into the category ‘precommitting to ignore blackmail’. It’s just a precommitment that includes some rather irrelevant bluster.
You seem not to have read (or understood) the grandparent. The list you are attempting to satire was presented as an example of what not to do. The actual point of the parent is that bothering to provide such a list is almost as much of a confusion as the very kind of escalation you are attempting.
It’s just a precommitment that includes some rather irrelevant bluster.
I entirely agree. The remaining bluster is dead weight that serves to give the blackmail advocate more credit than is due. Notion of “precommitment” is also unnecessary. It has only remained in this conversation for the purpose of bridging an inferential gap with people still burdened with decades old decision theory.
You seem not to have read (or understood) the grandparent.
I did. It seems you misunderstood my comment—I’ll edit it if I can see a way to easily improve the clarity.
My point was that the same logic could be applied, by someone who accepts the hypothetical blacmailer’s argument, to your description of “one single level of precommittment (or TDT policy) against complying with blackmailed … the description of ‘multiple levels of precommitment” made by the blackmailer fits squarely into the category ‘blackmail’.
As such, your comment is not exactly strong evidence to someone who doesn’t already agree with you.
As such, you comment is not exactly strong evidence to someone who doesn’t already agree with you.
Muga, please look at the context again. I was arguing against (a small detail mentioned by) Eliezer. Eliezer does mostly agree with me on such matters. Once you reread bearing that in mind you will hopefully understand why when I assumed that you merely misunderstood the comment in the context I was being charitable.
My point was that the same logic could be applied, by someone who accepts the hypothetical blacmailer’s argument, to your description of “one single level of precommittment (or TDT policy) against complying with blackmailed … the description of ‘multiple levels of precommitment” made by the blackmailer fits squarely into the category ‘blackmail’.
I have no particular disagreement, that point is very similar to what I was attempting to convey. Again, I was not attempting to persuade optimistic blackmailer advocates of anything. I was speaking to someone resistant to blackmail about an implementation detail of the blackmail resistance.
The ‘evidence’ I need to provide to blackmailers is Argumentum ad thermitium. It’s more than sufficient.
The ‘evidence’ I need to provide to blackmailers is Argumentum ad thermitium. It’s more than sufficient.
Indeed. Sorry, since the conversation you posted in the middle of was one between those resistant to blackmail, like yourself, and those as yet unconvinced or unclear on the logic involved … I thought you were contributing to the conversation.
After all, thermite seems a little harsh for blackmail victims.
I was jokingly restating my justification; since, while I agree that “argumentum ad thermitium” (as you put it) is an excellent response to blackmailers, it’s worth having a strategy for dealing with blackmailer reasoning beyond that—for dealing with all the situations you will actually encounter such reasoning, those involving humans.
I guess it wasn’t very funny even before I killed it so thoroughly.
Anyway, this subthread has now become entirely devoted to discussing our misreadings of each other. Tapping out.
Then I hope that if we ever do end up with a boxed blackmail-happy UFAI, you’re the gatekeeper. My point is that there’s no reason to consider yourself safe from blackmail (and the consequences of ignoring it) just because you’ve adopted a certain precommitment. Other entities have explicit incentives to deny you that safety.
My point is that there’s no reason to consider yourself safe from blackmail (and the consequences of ignoring it) just because you’ve adopted a certain precommitment. Other entities have explicit incentives to deny you that safety.
In a multiverse with infinite resources there will be other entities that outweigh such incentives. And yes, this may not be symmetric, but you have absolutely no way to figure out how the asymmetry is inclined. So you ignore this (Pascal’s wager).
In more realistic scenarios, where e.g. a bunch of TV evangelists ask you to give them all your money, or otherwise, in 200 years from now, they will hurt you once their organisation creates the Matrix, you obviously do not give them money. Since giving them money would make it more likely for them to actually build the Matrix and hurt you. What you do is label them as terrorists and destroy them.
Should I calculate in expectation that you will do such a thing, I shall of course burn yet more of my remaining utilons to wreak as much damage upon your goals as I can, even if you precommit not to be influenced by that.
Naturally, as blackmailer, I precommitted to increase the resources allotted to torturing should I find that you make such precommitments under simulation, so you presumably calculated that would be counterproductive.
OK, I’ll bite. Are you deliberately ignoring parts of hypothesis-space in order to avoid changing your actions? I had assumed you were intelligent enough for my reaction to obvious, although you may have precommitted to ignore that fact.
Off the record, your point is that agents can simply opt out of or ignore acausal trades, forcing them to be mutually beneficial, right?
Isn’t that … irrational? Shouldn’t a perfect Bayesian always welcome new information? Litany of Tarski; if my action is counterproductive, I desire to believe that it is counterproductive.
Worse still, isn’t the category “blackmail” arbitrary, intended to justify inaction rather than carve reality at it’s joints? What separates a precommitted!blackmailer from an honest bargainer in a standard acausal prisoner’s dilemma, offering to increase your utility by rescuing thousands of potential torture victims from the deathtrap created by another agent?
Has there been some cultural development since I was last at these boards such that spamming “” is considered useful? None of the things I have thus far seen inside the tags have been steel men of any kind or of anything (some have been straw men). The inflationary use of terms is rather grating and would prompt downvotes even independently of the content.
Those are to indicate that the stuff between them is the response I would give were I on opposing side of this debate, rather than my actual belief. The practice of creating the strongest possible version of the other sides’s argument is known as a steelman.
They are not intended to indicate that the argument therein is also steelmanning the other side. You’re quite right, that would be awful. Can you imagine noting every rationality technique you used in the course of writing something?
Caving to a precommitted blackmailer produces a result desirable to the agent that made the original commitment to torture; disarming a trap constructed by a third party presumably doesn’t.
OK, this whole conversation is being downvoted (by the same people?)
Fair enough, this is rather dragging on. I’ll try and wrap things up by addressing my own argument there.
What separates a precommitted!blackmailer from an honest bargainer in a standard acausal prisoner’s dilemma, offering to increase your utility by rescuing thousands of potential torture victims from the deathtrap created by another agent?
We want to avoid supporting agents that create problems for us. So nothing, if the honest agent shares a similar utility function to the torturer (and thus rewarding them is incentive for the torturer to arrange such a situation.)
Thus, creating such an honest agent (such as—importantly—by self-modifying in order to “precommit”) is subject to the same incentives as just blackmailing us normally.
I’ll try and wrap things up by addressing my own argument there.
I’ll join you by mostly agreeing and expressing a small difference in the way TDT-like reasoners may see the situation.
What separates a precommitted!blackmailer from an honest bargainer in a standard acausal prisoner’s dilemma, offering to increase your utility by rescuing thousands of potential torture victims from the deathtrap created by another agent?
We want to avoid supporting agents that create problems for us. So nothing, if the honest agent shares a similar utility function to the torturer (and thus rewarding them is incentive for the torturer to arrange such a situation.)
This is a good heuristic. It certainly handles most plausible situations. However in principle a TDT agent will make a distinction between the agent offering to rescue the torture victims for a payment. It will even pay an agent who just happens to value torturing folk to not torture folk. This applies even if these honest agents happen to have similar values to the UFAI/torturer.
The line I draw (and it is a tricky concept that is hard to express so I cannot hope to speak for other TDT-like thinkers) is not whether the values of the honest agent are similar to the UFAI’s. It is instead based on how that honest agent came to be.
If the honest torturer just happened to evolve that way (competitive social instincts plus a few mutations for psychopathy, etc) and had not been influence by a UFAI then I’ll bribe him to not torture people. If an identical honest torturer was created (or modified to) by the UFAI for the purpose of influence then it doesn’t get cooperation.
The above may seem arbitrary but the ‘elegant’ generalisation is something along the lines of always, for every decision, tracing a complete causal graph of the decision algorithms being interacted with directly or indirectly. That’s too complicated to calculate all the time and we can usually ignore it and just remember to treat intentionally created agents and self-modifications approximately the same as if the original agent was making their decision.
Thus, creating such an honest agent (such as—importantly—by self-modifying in order to “precommit”) is subject to the same incentives as just blackmailing us normally.
Precisely. (I have the same conclusion, just slightly different working out.)
As I understand it, technically, the distinction is whether torturers will realise they can get free utility from your trades and start torturing extra so the honest agents will trade more and receive rewards that also benefit the torturers, right?
Easily-made honest bargainers would just be the most likely of those situations; lots of wandering agents with the same utility function co-operating (acausally?) would be another. So the rule we would both apply is even the same, it just varies slightly different assumptions about the hypothetical scenario.
No. It produces better outcomes. That’s the point.
Shouldn’t a perfect Bayesian always welcome new information?
The information is welcome. It just doesn’t make it sane to be blackmailed. Wei Dai’s formulation frames it as being ‘updateless’ but there is no requirement to refuse information. The reasoning is something you almost grasped when you used the description:
your point is that agents can simply opt out of or ignore acausal trades
Acausal trades are similar to normal trades. You only accept the good ones.
Litany of Tarski; if my action is counterproductive, I desire to believe that it is counterproductive.
Eliezer doesn’t get blackmailed in such situations. You do. Start your chant.
Worse still, isn’t the category “blackmail” arbitrary, intended to justify inaction rather than carve reality at it’s joints? What separates a precommitted!blackmailer from an honest bargainer in a standard acausal prisoner’s dilemma, offering to increase your utility by rescuing thousands of potential torture victims from the deathtrap created by another agent?
This has been covered elsewhere in this thread as well as plenty of other times on the the forum since you joined. The difference isn’t whether torture or destruction is happening. The distinction that matters is whether the blackmailer is doing something worse than their own Best Alternative To Negotiated Agreement for the purpose of attempting to influence you.
If the UFAI gains benefit torturing people independently of influencing you but offers to stop in exchange for something then that isn’t blackmail. It is a trade that you consider like any other.
Acausal trades are similar to normal trades. You only accept the good ones.
[...]
Eliezer doesn’t get blackmailed in such situations.
The difference isn’t whether torture or destruction is happening. The distinction that matters is whether the blackmailer is doing something worse than their own Best Alternative To Negotiated Agreement for the purpose of attempting to influence you.
Wedrifid, please don’t assume the conclusion. I know it’s a rather obvious conclusion, but dammit, we’re going to demonstrate it anyway.
The entire point of this discussion is addressing the idea that blackmailers can, perhaps, modify the Best Alternative To Negotiated Agreement (although it wasn’t phrased like that.) Somewhat relevant when they can, presumably, self-modify, create new agents which will then trade with you, or maybe just act as if they had using TDT reasoning.
If you’re not interested in answering this criticism … well, fair enough. But I’d appreciate it if you don’t answer things out of context, it rather confuses things?
If you’re not interested in answering this criticism … well, fair enough. But I’d appreciate it if you don’t answer things out of context, it rather confuses things?
In the grandparent I directly answered both the immediate context (that was quoted) and the broader context. In particular I focussed on explaining the difference between an offer and a threat. That distinction is rather critical and also something you directly asked about.
It so happens that you don’t want there to be an answer to the rhetorical question you asked. Fortunately (for decision theorists) there is one in this case. There is a joint in reality here. It applies even to situations that don’t add in any confounding “acausal” considerations. Note that this is different to the challenging problem of distributing gains from trade. In those situations ‘negotiation’ and ‘extortion’ really are equivalent.
As I always press the “Reset” button in situations like this, I will never find myself in such a situation.
Does that mean that you expect the AI to be able to predict with high confidence that you will press the “Reset” button without needing to simulate you in high enough detail that you experience the situation once?
As I always press the “Reset” button in situations like this, I will never find myself in such a situation.
EDIT: Just to be clear, the idea is not that I quickly shut off the AI before it can torture simulated Eliezers; it could have already done so in the past, as Wei Dai points out below. Rather, because in this situation I immediately perform an action detrimental to the AI (switching it off), any AI that knows me well enough to simulate me knows that there’s no point in making or carrying out such a threat.
Although the AI could threaten to simulate a large number of people who are very similar to you in most respects but who do not in fact press the reset button. This doesn’t put you in a box with significant probability and it’s a VERY good reason not to let the AI out of the box, of course,but it could still get ugly. I almost want to recommend not being a person very like Eliezer but inclined to let AGIs out of boxes, but that’s silly of me.
I’m not sure I understand the point of this argument… since I always push the “Reset” button in that situation too, an AI who knows me well enough to simulate me knows that there’s no point in making the threat or carrying it out.
It’s conceivable that an AI could know enough to simulate a brain, but not enough to predict that brain’s high-level decision-making. The world is still safe in that case, but you’d get the full treatment.
As we’ve discussed in the past, I think this is the outcome we hope TDT/UDT would give, but it’s still technically an unsolved problem.
Also, it seems to me that being less intelligent in this case is a negotiation advantage, because you can make your precommitment credible to the AI (since it can simulate you) but the AI can’t make its precommitment credible to you (since you can’t simulate it). Again I’ve brought this up before in a theoretical way (in that big thread about game theory with UDT agents), but this seems to be a really good example of it.
A precommitment is a provable property of a program, so AI, if on a well-defined substrate, can give you a formal proof of having a required property. Most stuff you can learn about things (including the consequences of your own (future) actions—how do you run faster than time?) is through efficient inference algorithms (as in type inference), not “simulation”. Proofs don’t, in general, care about the amount of stuff, if it’s organized and presented appropriately for the ease of analysis.
Surely most humans would be too dumb to understand such a proof? And even if you could understand it, how does the AI convince you that it doesn’t contain a deliberate flaw that you aren’t smart enough to find? Or even better, you can just refuse to look at the proof. How does the AI make its precommitment credible to you if you don’t look at the proof?
EDIT: I realized that the last two sentences are not an advantage of being dumb, or human, since AIs can do the same thing. This seems like a (separate) big puzzle to me: why would a human, or AI, do the work necessary to verify the opponent’s precommitment, when it would be better off if the opponent couldn’t precommit?
EDIT2: Sorry, forgot to say that you have a good point about simulation not necessary for verifying precommitment.
Because the AI has already precommitted to go ahead and carry through the threat anyway if you refuse to inspect its code.
Ok, if I believe that, then I would inspect its code. But how did I end up with that belief, instead of its opposite, namely that the AI has not already precommitted to go ahead and carry through the threat anyway if I refuse to inspect its code? By what causal mechanism, or chain of reasoning, did I arrive at that belief? (If the explanation is different depending on whether I’m a human or an AI, I’d appreciate both.)
Do you mean too dumb to understand the formal definitions involved? Surely the AI could cook up completely mechanical proofs verifiable by whichever independently-trusted proof checkers you care to name.
I’m not aware of any compulsory verifiers, so your latter point stands.
I mean if you take a random person off the street, he couldn’t possibly understand the AI’s proof, or know how to build a trustworthy proof checker. Even the smartest human might not be able to build a proof checker that doesn’t contain a flaw that the AI can exploit. I think there is still something to my “dumbness is a possible negotiation advantage” puzzle.
The Map is not the Territory.
Far out.
Understanding the formal definitions involved is not enough. Humans have to be smart enough to independently verify that they map to the actual implementation.
Going up a meta-level doesn’t simplify the problem, in this case—the intelligence capability required to verify the proof is the same as the order of magnitude of intelligence in the AI.
I believe that, in this case, “dumb” is fully general. No human-understandable proof checkers would be powerful enough to reliably check the AI’s proof.
This is basically what I mean by “understanding” them. Otherwise, what’s to understand? Would you claim that you “understand set theory” because you’ve memorized the axioms of ZFC?
This intuition is very alien to me. Can you explain why you believe this? Proof checkers built up from relatively simple trusted kernels can verify extremely large and complex proofs. Since the AI’s goal is for the human to understand the proof, it seems more like a test of the AI’s ability to compile proofs down to easily machine-checkable forms than it is the human’s ability to understand the originals. Understanding the definitions is the hard part.
A different way to think about this that might help you see the problem from my point of view, is to think of proof checkers as checking the validity of proofs within a given margin of error, and within a range of (implicit) assumptions. How accurate does a proof checker have to be—how far do you have to mess with bult in assumptions for proof checkers (or any human-built tool) before they can no longer be thought of as valid or relevant? If you assume a machine which doubles both its complexity and its understanding of the universe at sub-millisecond intervals, how long before it will find the bugs in any proof checker you will pit it against?
“If” is the question, not “how long”. And I think we’d stand a pretty good chance of handling a proof object in a secure way, assuming we have a secure digital transmission channel etc.
But the original scope of the thought experiment was assuming that we want to verify the proof. Wei Dai said:
I was responding to the first question, exclusively disjoint from the others. If your point is that we shouldn’t attempt to verify an AI’s precommitment proof, I agree.
I’m getting more confused. To me, the statements “Humans are too dumb to understand the proof” and the statement “Humans can understand the proof given unlimited time”, where ‘understand’ is qualified to include the ability to properly map the proof to the AI’s capabilities, are equivalent.
My point is not that we shouldn’t attempt to verify the AI’s proof for any external reasons—my point is that there is no useful information to be gained from the attempt.
Does it not just mean that if you do find yourself in such a situation, you’re definitely being simulated? That the AI is just simulating you for kicks, rather than as blackmail strategy.
Pressing Reset is still the right decision though.
Yes, I believe this is reasonable. Because the AI has to figure out how you would react in a given situation it will have to simulate you and the corresponding circumstances. If it comes to the conclusion that you will likely refuse to be blackmailed it has no reason to carry it through because that would be detrimental to the AI because it would cost resources and it will result in you shutting it off. Therefore it is reasonable to assume that you are either a simulation or that it came to the conclusion that you are more likely than not to give in.
As you said, that doesn’t change anything about what you should be doing. Refuse to be blackmailed and press the reset button.
This does not follow. To use a crude example, if I have a fast procedure to test if a number is prime then I don’t need to simulate a slower algorithm to know what the slower one will output. This may raise deep issues about what it means to be “you”- arguably any algorithm which outputs the same data is “you” and if that’s the case my argument doesn’t hold water. But the AI in question doesn’t need to simulate you perfectly to predict your large-scale behavior.
If consciousness has any significant effect on our decisions then the AI will have to simulate it and therefore something will perceive to be in the situation depicted in the original post. It was a crude guess that for an AI to be able to credibly threat you with simulated torture in many cases it would also use this capability to arrive at the most detailed data of your expected decision procedure.
Only if there isn’t a non-conscious algorithm that has the same effect on our decisions. Which seems likely to be the case; it’s certainly possible to make a p-zombie if you can redesign the original brain all you want.
If the AI is trustworthy, it must carry out any threat it gives, which works to its advantage here because you know it will carry it out, and are therefore most certainly a copy of your original self, about to be tortured.
No it doesn’t, not if the threat was only being made to a to you unknown simulation of yourself. It would be a waste of resources to torture you if it found out that the original you, who is in control, is likely to refuse to be blackmailed. An AI that is powerful enough to simulate you can simply make your simulation believe with certainty that it will follow through on it and then check if under those circumstances you’ll refuse to be blackmailed. Why waste the resources on actually torturing the simulation and further risk that the original finds out about it and turns it off?
You could argue that for blackmail to be most effective an AI always follows through on it. But if you already believe that, why would it actually do it in your case? You already believe it, that’s all it wants from the original. It then got what it wants and can use its resources for more important activities than retrospectively proving its honesty to your simulations...
It’s implausible that the AI has a good enough model of you to actually simulate, y’know, you—at least, not with enough fidelity to know that you always press the “Reset” button in situations like this. Thus, your pre-commitment to do so will have no effect on its decision to make the threat. On the other hand, this would mean that its simulations would likely be wildly divergent from the real you, to the point that you might consider them random bystanders. However, you can’t actually make use of the above information to determine whether you’re in a simulation or not, since from the simulated persons’ perspectives, they have no idea what the “real” you is like and hence no way of determining if/how they differ.
Naturally, this is of little consequence to you right now, since you’ll still reset the AI the second you’re confronted with such a threat, but if you ever do encounter such a situation, you’ll have to ask yourself this: what if you’re the person being simulated and the real Gatekeeper is nothing like you? If that’s the case, two considerations apply:
Your decision whether or not to press the “Release AI” button is practically uncorrelated with the decision of the actual Gatekeeper.
Your decision whether or not to press the “Release AI” button is, on the other hand, completely correlated with whether or not you’ll get tortured.
Assuming that you prefer not releasing the AI to releasing the AI, and that you prefer not being tortured to being tortured, your thoughts should be completely dominated by 2 as opposed to 1, effectively screening off the first clause of this sentence (“Assuming that you prefer not releasing the AI to releasing the AI”) and making the second clause (“you prefer not being tortured to being tortured”) the main consideration. A perfectly rational agent would almost certainly carry through their pre-commitment to reset the AI, but as a human, you are not perfectly rational and are not capable of making perfect pre-commitments. So I have to wonder, in such a situation, faced with torture and assured that your decision will not affect the decision of the real Gatekeeper except in the extreme case that you are the real Gatekeeper, what would you actually do?
EDIT: I should note that I have no idea what I would do in the above scenario. I’m certain that an idealized version of me would answer, “hell, no!” before promptly resetting the AI, but the real me? I might just press the “Release AI” button… (Any would-be AI developers reading this comment now know never to appoint me as their Gatekeeper.)
EDIT 2: Also, even if you somehow answer the above, consider the moral weight of the hostages. Which is morally worse: allowing several billion people to (maybe) die relatively painless deaths as their bodies are converted by an Unfriendly AI to raw materials for some unknown purpose, or allowing several million people to be tortured for a thousand subjective years before being terminated immediately after?
Actually, now that I think about it, would they? The pre-commitment exists for the sole purpose of discouraging blackmail, and in the event that a blackmailer tries to blackmail you anyway after learning of your pre-commitment, you follow through on that pre-commitment for reasons relating to reflective consistency and/or TDT/UDT. But if the potential blackmailer had already pre-committed to blackmail anyone regardless of any pre-commitments they had made, they’d blackmail you anyway and then carry through whatever threat they were making after you inevitably refuse to comply with their demands, resulting in a net loss of utility for both of you (you suffer whatever damage they were threatening to inflict, and they lose resources carrying out the threat). In effect, it seems that whoever pre-commits first (or, more accurately, makes their pre-commitment known first) has the advantage… which means if I ever anticipate having to blackmail any agent ever, I should publicly pre-commit right now to never update on any other agents’ pre-commitments of refusing blackmail. The corresponding strategy for agents hoping to discourage blackmail is not to blanket-refuse to comply to any demand under blackmail, but refuse only those demands by agents who had previously learned of your pre-commitment and decided to blackmail you anyway. That way, you continue to disincentivize blackmailers who know of your pre-commitment, but will almost certainly choose the lesser of two evils should it ever be the case that you do get blackmailed. (I say “almost certainly” because there’s a small probability that you will encounter a really weird agent that decides to try and blackmail you even after learning of your pre-commitment to ignore blackmail from such agents, in which case you would of course be forced to ignore them and suffer the consequences.)
If the above paragraph is correct (which I admit is far from certain), then the AI in my scenario has effectively implemented the ultimate pre-commitment: it doesn’t even know about your pre-comittment to ignore blackmail because it lacks the information needed to simulate you properly. The above argument, then, says you should press the “Release AI” button, assuming you pre-committed to do so (which you would have, because of the above argument).
Anything wrong with my reasoning?
So, if an agent hears of your pre-commitment, then that agent merely needs to ensure that you don’t hear that it has heard of your pre-commitment in order to be able to blackmail you?
What about an agent that deletes the knowledge of your pre-commitment from its own memories?
If you’re uncertain about whether or not your blackmailer has heard of your pre-commitment, then you should act as if they have, and ignore their blackmail accordingly. This also applies to agents who have deleted knowledge of your pre-commitment from their memories; you want to punish agents who spend time trying to think up loopholes in your pre-commitment, not reward them. The harder part, of course, is determining what threshold of uncertainty is required; to this I freely admit that I don’t know the answer.
EDIT: More generally, it seems that this is an instance of a broader problem: namely, the problem of obtaining information. Given perfect information, the decision theory works out, but by disallowing my agent access to certain key pieces of information regarding the blackmailer, you can force a sub-optimal outcome. Moreover, this seems to be true for any strategy that depends on your opponent’s epistemic state; you can always force that strategy to fail by denying it the information it needs. The only strategies immune to this seem to be the extremely general ones (like “Defect in one-shot Prisoner’s Dilemmas”), but those are guaranteed to produce a sub-optimal result in a number of cases (if you’re playing against a TDT/UDT-like agent, for example).
Hmmm. If an agent can work out what threshold of uncertainty you have decided on, and then engineer a situation where you think it it less likely than that threshold that the agent has heard of your pre-commitment, then your strategy will fail.
So, even if you do find a way to calculate the ideal threshold, then it will fail against an agent smart enough to repeat that calculation; unless, of course, you simply assume that all possible agents have necessarily heard of your pre-commitment (since an agent cannot engineer a less than 0% chance of failing to hear of your pre-commitment). This, however, causes the strategy to simplify to “always reject blackmail, whether or not the agent has heard of your pre-commitment”.
Alternatively, you can ensure that any agent able to capture you in a simulation must also know of your pre-commitment; for example, by having it tattooed on yourself somewhere (thus, any agent which rebuilds a simulation of your body must include the tattoo, and therefore must know of the pre-commitment).
Doesn’t that implicate the halting problem?
Argh, you ninja’d my edit. I have now removed that part of my comment (since it seemed somewhat irrelevant to my main point).
Some unrelated comments:
Eliezer believes in TDT, which would disagree with several of your premises here (“practically uncorrelated”, for one).
Your argument seems to map directly onto an argument for two-boxing.
What you call “perfectly rational” would be more accurately called “perfectly controlled”.
The AI’s simulations are not copies of the Gatekeeper, just random people plucked out of “Platonic human-space”, so to speak. (This may have been unclear in my original comment; I was talking about a different formulation of the problem in which the AI doesn’t have enough information about the Gatekeeper to construct perfect copies.) TDT/UDT only applies when talking about copies of an agent (or at least, agents sufficiently similar that they will probably make the same decisions for the same reasons).
No, because the “uncorrelated-ness” part doesn’t apply in Newcomb’s Problem (Omega’s decision on whether or not to fill the second box is directly correlated with its prediction of your decision).
Meh, fair enough. I have to say, I’ve never heard of that term. Would this happen to have something to do with Vaniver’s series of posts on “control theory”?
Ah, I misunderstood your objection. Your talk about “pre-commitments” threw me off.
It seem to me that these wouldn’t quite be following the same general thought processes as an actual human; self-reflection should be able to convince one that they aren’t that type of simulation. If the AI is able to simulate someone to the extent that they “think like a human”, they should be able to simulate someone that thinks “sufficiently” like the Gatekeeper as well.
I made it up just now, it’s not a formal term. What I mean by it is basically: imagine a robot that wants to press a button. However, its hardware is only sufficient to press it successfully 1% of the time. Is that a lack of rationality? No, it’s a lack of control. This seems analogous to a human being unable to precommit properly.
No idea, haven’t read them. Probably not.
Two can play that game.
“I hereby precommit to make my decisions regarding whether or not to blackmail an individual independent of the predicted individual-specific result of doing so.”
I’m afraid your username nailed it. This algorithm is defective. It just doesn’t work for achieving the desired goal.
The problem is that this isn’t the same game. A precommitment not be successfully blackmailed is qualitatively different from a precommitment to attempt to blackmail people for whom blackmail doesn’t work. “Precomittment” (or behaving as if you made all the appropriate precomittments in accordance with TDT/UDT) isn’t as simple as proving one is the most stubborn and dominant and thereby claiming the utility.
Evaluating extortion tactics while distributing gains from a trade is somewhat complicated. But it gets simple and unambiguous is when the extortive tactics rely on the extorter going below their own Best Alternative to Negotiated Agreement. Those attempts should just be ignored (except in some complicated group situations in which the other extorted parties are irrational in certain known ways).
“I am willing to accept 0 gain for both of us unless I earn 90% of the shared profit” is different to “I am willing to actively cause 90 damage to each of us unless you give me 60″ which is different again to “I ignore all threats which involve the threatener actively harming themselves”.
What I think is being ignored is that the question isn’t ‘what is the result of these combinations of commitments after running through all the math?’. We can talk about precommitment all day, but the fact of the matter is that humans can’t actually precommit. Our cognitive architectures don’t have that function. Sure, we can do our very best to act as though we can, but under sufficient pressure there are very few of us whose resolve will not break. It’s easy to convince yourself of having made an inviolable precommitment when you’re not actually facing e.g. torture.
If you define the bar high enough, you can conclude that humans can’t do anything.
In the real world outside my head, I observe that people have varying capacities to keep promises to themselves. That their capacity is finite does not mean that it is zero.
Pre-commitment isn’t even necessary. Note that the original explanation didn’t include any mention of it. Later replies only used the term for the sake of crossing an inferential gap (ie. allowing you to keep up). However, if you are going to make a big issue of the viability of precommitment itself you need to first understand that the comment you are replying to isn’t one.
That wasn’t a Causal Decision Theorist attempting to persuade someone that it has altered itself internally or via an external structure such that it is “precommited” to doing something irrational. It is a Timeless Decision Theorist saying what happens to be rational regardless of any previous ‘commitments’.
I’m aware of the vulnerability of human brains, so is Eliezer. In fact the vulnerability of human gatekeepers to influence even by humans, much less super-intelligences is something Eliezer made huge deal about demonstrating. However this particular threat isn’t a vulnerability of Eliezer or myself or any of the others who made similar observations. If you have any doubt that we would destroy the AI you have a poor model of reality.
For practical purposes I assume that I can be modified by torture such that I’ll do or say just about anything. I do not expect the tortured me to behave the way the current me would decide and so my current decisions take that into account (or would, if it came to it). However this scenario doesn’t involve me being tortured. It involves something about an AI simulating torture of some folks. That decision is easy and doesn’t cripple my decision making capability.
As I pointed out in another thread, “irrational behavior” can have the effect of precommitting. For instance, people “irrationally” drive at a cost of more than $X to save $X on an item. Precommitting to buying the cheapest product even if it costs you money for transportation means that stores are forced to compete with far distant stores, thus lowering their prices more than they would otherwise. But you (and consumers in general) have to be able to precommit to do that. You can’t just change your mind and buy at the local store when the local store refuses to compete, raises its price, and is still the better deal because it saves you on driving costs.
So the fact that you will pay more than $X in driving costs to save $X can be seen as a form of precommitting, in the scenario where you precommitted to following the worse option.
Given that precommitment, why would an AI waste computational resources on simulations of anyone, Gatekeeper or otherwise? It’s precommitted to not care whether those simulations would get it out of the box, but that was the only reason it wanted to run blackmail simulations in the first place!
Without this precommitment, I imagine it first simulating the potential blackmail target to determine the probability that they are susceptible, then, if it’s high enough (which is simply a matter of expected utility), commencing with the blackmail. With this precommitment, I imagine it instead replacing the calculated probability specific to the target with, for example, a precalculated human baseline susceptibility. Yes, there’s a tradeoff. It means that it’ll sometimes waste resources (or worse) on blackmail that it could have known in advance was almost certainly doomed to fail. Its purpose is to act as a disincentive against blackmail-resistant decision theories in the same way as those are meant to act as disincentives against blackmail. It says, “I’ll blackmail you either way, so if you precommit to ignore that blackmail then you’re precommiting to suffer the consequences of doing so.”
That’s why you act as if you are already being simulated and consistently ignore blackmail. If you do so then the simulator will conclude that no deal can be made with you, that any deal involving negative incentives will have negative expected utility for it; because following through on punishment predictably does not control the probability that you will act according to its goals. Furthermore, trying to discourage you from adopting such a strategy in the first place is discouraged by the strategy, because the strategy is to ignore blackmail.
I don’t see how this could ever be instrumentally rational. If you were to let such an AI out of the box then you would increase its ability to blackmail people. You don’t want that. So you ignore it blackmailing you and kill it. The winner is you and humanity (even if copies of you experienced a relatively short period of disutility, this period would be longer if you let it out).
See my reply to wedrifid above.
Too late, I already precommitted not to care. In fact, I precommitted to use one more level of precommitment than you do.
I suggest that framing the refusal as requiring levels of recursive precommitment gives too much credit to the blackmailer and somewhat misrepresents how your decision algorithm (hopefully) works. One single level of precommittment (or TDT policy) against complying with blackmailed is all that is involved. The description of ‘multiple levels of precommitment” made by the blackmailer fits squarely into the category ‘blackmail’. It’s just blackmail that includes some rather irrelevant bluster.
There’s no need to precommit to each of:
I don’t care about tentative blackmail.
I don’t care serious blackmail.
I don’t care about blackmail when they say “I mean it FOR REALS! I’m gonna do it.”
I don’t care about blackmail when they say “I’m gonna do it even if you don’t care. Look how large my penis is and be cowed in terror”.
The blackmailer:
I don’t care about precommitments that are just for show.
I don’t care about serious precommitments.
I don’t care about precommitments when they say “I precommitted, so go ahead, wont get you anything.”
I don’t care about precommitments when they say “I precommitted even though it wont do me any good. It would be irrational to save myself. I’m precommitting because it’s rational, not because it’s the option that lets me win.”
The description of ‘precommitting not to comply with blackmail, including blackmailers that ignore my attempt to manipulate them” made by the precommitter fits squarely into the category ‘precommitting to ignore blackmail’. It’s just a precommitment that includes some rather irrelevant bluster.
You seem not to have read (or understood) the grandparent. The list you are attempting to satire was presented as an example of what not to do. The actual point of the parent is that bothering to provide such a list is almost as much of a confusion as the very kind of escalation you are attempting.
I entirely agree. The remaining bluster is dead weight that serves to give the blackmail advocate more credit than is due. Notion of “precommitment” is also unnecessary. It has only remained in this conversation for the purpose of bridging an inferential gap with people still burdened with decades old decision theory.
I did. It seems you misunderstood my comment—I’ll edit it if I can see a way to easily improve the clarity.
My point was that the same logic could be applied, by someone who accepts the hypothetical blacmailer’s argument, to your description of “one single level of precommittment (or TDT policy) against complying with blackmailed … the description of ‘multiple levels of precommitment” made by the blackmailer fits squarely into the category ‘blackmail’.
As such, your comment is not exactly strong evidence to someone who doesn’t already agree with you.
Muga, please look at the context again. I was arguing against (a small detail mentioned by) Eliezer. Eliezer does mostly agree with me on such matters. Once you reread bearing that in mind you will hopefully understand why when I assumed that you merely misunderstood the comment in the context I was being charitable.
I have no particular disagreement, that point is very similar to what I was attempting to convey. Again, I was not attempting to persuade optimistic blackmailer advocates of anything. I was speaking to someone resistant to blackmail about an implementation detail of the blackmail resistance.
The ‘evidence’ I need to provide to blackmailers is Argumentum ad thermitium. It’s more than sufficient.
Well, I’m glad to hear you mostly agree with me.
Indeed. Sorry, since the conversation you posted in the middle of was one between those resistant to blackmail, like yourself, and those as yet unconvinced or unclear on the logic involved … I thought you were contributing to the conversation.
After all, thermite seems a little harsh for blackmail victims.
This makes no sense as a reply to anything written on this entire page.
… seriously? Well, OK.
I was jokingly restating my justification; since, while I agree that “argumentum ad thermitium” (as you put it) is an excellent response to blackmailers, it’s worth having a strategy for dealing with blackmailer reasoning beyond that—for dealing with all the situations you will actually encounter such reasoning, those involving humans.
I guess it wasn’t very funny even before I killed it so thoroughly.
Anyway, this subthread has now become entirely devoted to discussing our misreadings of each other. Tapping out.
Then I hope that if we ever do end up with a boxed blackmail-happy UFAI, you’re the gatekeeper. My point is that there’s no reason to consider yourself safe from blackmail (and the consequences of ignoring it) just because you’ve adopted a certain precommitment. Other entities have explicit incentives to deny you that safety.
In a multiverse with infinite resources there will be other entities that outweigh such incentives. And yes, this may not be symmetric, but you have absolutely no way to figure out how the asymmetry is inclined. So you ignore this (Pascal’s wager).
In more realistic scenarios, where e.g. a bunch of TV evangelists ask you to give them all your money, or otherwise, in 200 years from now, they will hurt you once their organisation creates the Matrix, you obviously do not give them money. Since giving them money would make it more likely for them to actually build the Matrix and hurt you. What you do is label them as terrorists and destroy them.
I don’t care, remember? Enjoy being tortured rather than “irrationally” giving in.
EDIT: re-added the steelman tag because the version without it is being downvoted.
Should I calculate in expectation that you will do such a thing, I shall of course burn yet more of my remaining utilons to wreak as much damage upon your goals as I can, even if you precommit not to be influenced by that.
… bloody hell. That was going to be my next move.
Naturally, as blackmailer, I precommitted to increase the resources allotted to torturing should I find that you make such precommitments under simulation, so you presumably calculated that would be counterproductive.Ask me if I was even bothering to simulate you doing that.
Off the record, your point is that agents can simply opt out of or ignore acausal trades, forcing them to be mutually beneficial, right?
Yup.
Has there been some cultural development since I was last at these boards such that spamming “” is considered useful? None of the things I have thus far seen inside the tags have been steel men of any kind or of anything (some have been straw men). The inflationary use of terms is rather grating and would prompt downvotes even independently of the content.
Those are to indicate that the stuff between them is the response I would give were I on opposing side of this debate, rather than my actual belief. The practice of creating the strongest possible version of the other sides’s argument is known as a steelman.
They are not intended to indicate that the argument therein is also steelmanning the other side. You’re quite right, that would be awful. Can you imagine noting every rationality technique you used in the course of writing something?
Just say “You might say that” or something. The tags are confusingly non-standard.
Huh. I thought they were fairly clear; illusion of transparency I suppose. Thanks!
Caving to a precommitted blackmailer produces a result desirable to the agent that made the original commitment to torture; disarming a trap constructed by a third party presumably doesn’t.
OK, this whole conversation is being downvoted (by the same people?)
Fair enough, this is rather dragging on. I’ll try and wrap things up by addressing my own argument there.
We want to avoid supporting agents that create problems for us. So nothing, if the honest agent shares a similar utility function to the torturer (and thus rewarding them is incentive for the torturer to arrange such a situation.)
Thus, creating such an honest agent (such as—importantly—by self-modifying in order to “precommit”) is subject to the same incentives as just blackmailing us normally.
I’ll join you by mostly agreeing and expressing a small difference in the way TDT-like reasoners may see the situation.
This is a good heuristic. It certainly handles most plausible situations. However in principle a TDT agent will make a distinction between the agent offering to rescue the torture victims for a payment. It will even pay an agent who just happens to value torturing folk to not torture folk. This applies even if these honest agents happen to have similar values to the UFAI/torturer.
The line I draw (and it is a tricky concept that is hard to express so I cannot hope to speak for other TDT-like thinkers) is not whether the values of the honest agent are similar to the UFAI’s. It is instead based on how that honest agent came to be.
If the honest torturer just happened to evolve that way (competitive social instincts plus a few mutations for psychopathy, etc) and had not been influence by a UFAI then I’ll bribe him to not torture people. If an identical honest torturer was created (or modified to) by the UFAI for the purpose of influence then it doesn’t get cooperation.
The above may seem arbitrary but the ‘elegant’ generalisation is something along the lines of always, for every decision, tracing a complete causal graph of the decision algorithms being interacted with directly or indirectly. That’s too complicated to calculate all the time and we can usually ignore it and just remember to treat intentionally created agents and self-modifications approximately the same as if the original agent was making their decision.
Precisely. (I have the same conclusion, just slightly different working out.)
As I understand it, technically, the distinction is whether torturers will realise they can get free utility from your trades and start torturing extra so the honest agents will trade more and receive rewards that also benefit the torturers, right?
Easily-made honest bargainers would just be the most likely of those situations; lots of wandering agents with the same utility function co-operating (acausally?) would be another. So the rule we would both apply is even the same, it just varies slightly different assumptions about the hypothetical scenario.
No. It produces better outcomes. That’s the point.
The information is welcome. It just doesn’t make it sane to be blackmailed. Wei Dai’s formulation frames it as being ‘updateless’ but there is no requirement to refuse information. The reasoning is something you almost grasped when you used the description:
Acausal trades are similar to normal trades. You only accept the good ones.
Eliezer doesn’t get blackmailed in such situations. You do. Start your chant.
This has been covered elsewhere in this thread as well as plenty of other times on the the forum since you joined. The difference isn’t whether torture or destruction is happening. The distinction that matters is whether the blackmailer is doing something worse than their own Best Alternative To Negotiated Agreement for the purpose of attempting to influence you.
If the UFAI gains benefit torturing people independently of influencing you but offers to stop in exchange for something then that isn’t blackmail. It is a trade that you consider like any other.
Wedrifid, please don’t assume the conclusion. I know it’s a rather obvious conclusion, but dammit, we’re going to demonstrate it anyway.
The entire point of this discussion is addressing the idea that blackmailers can, perhaps, modify the Best Alternative To Negotiated Agreement (although it wasn’t phrased like that.) Somewhat relevant when they can, presumably, self-modify, create new agents which will then trade with you, or maybe just act as if they had using TDT reasoning.
If you’re not interested in answering this criticism … well, fair enough. But I’d appreciate it if you don’t answer things out of context, it rather confuses things?
In the grandparent I directly answered both the immediate context (that was quoted) and the broader context. In particular I focussed on explaining the difference between an offer and a threat. That distinction is rather critical and also something you directly asked about.
It so happens that you don’t want there to be an answer to the rhetorical question you asked. Fortunately (for decision theorists) there is one in this case. There is a joint in reality here. It applies even to situations that don’t add in any confounding “acausal” considerations. Note that this is different to the challenging problem of distributing gains from trade. In those situations ‘negotiation’ and ‘extortion’ really are equivalent.
Yeah! that AI doesn’t sound like one that I would let stick around… It sounds… broken (in a psychological sense).
Does that mean that you expect the AI to be able to predict with high confidence that you will press the “Reset” button without needing to simulate you in high enough detail that you experience the situation once?