I think saying “Roko’s arguments [...] weren’t generally accepted by other Less Wrong users” is not giving the whole story. Yes, it is true that essentially nobody accepts Roko’s arguments exactly as presented. But a lot of LW users at least thought something along these lines was plausible. Eliezer thought it was so plausible that he banned discussion of it (instead of saying “obviously, information hazards cannot exist in real life, so there is no danger discussing them”).
In other words, while it is true that LWers didn’t believe Roko’s basilisk, they thought is was plausible instead of ridiculous. When people mock LW or Eliezer for believing in Roko’s Basilisk, they are mistaken, but not completely mistaken—if they simply switched to mocking LW for believing the basilisk is plausible, they would be correct (though the mocking would still be mean, of course).
Eliezer thought it was so plausible that he banned discussion of it
If you are a programmer and think your code is safe because you see no way things could go wrong, it’s still not good to believe that it isn’t plausible that there’s a security hole in your code.
You rather practice defense in depth and plan for the possibility that things can go wrong somewhere in your code, so you add safety precautions. Even when there isn’t what courts call reasonable doubt a good safety engineer still adds additional safety procautions in security critical code.
Eliezer deals with FAI safety. As a result it’s good for him to have mindset of really caring about safety.
German nuclear power station have trainings for their desk workers to teach the desk workers to not cut themselves with paper. That alone seems strange to outsiders but everyone in Germany thinks that it’s very important for nuclear power stations to foster a culture of safety even when that means something going overboard.
If you are a programmer and think your code is safe because you see no way things could go wrong, it’s still not good to believe that it isn’t plausible that there’s a security hole in your code.
Let’s go with this analogy. The good thing to do is ask a variety of experts for safety evaluations, run the code through a wide variety of tests, etc. The think NOT to do is keep the code a secret while looking for mistakes all by yourself. If you keep your code out of the public domain, it is more likely to have security issues, since it was not scrutinized by the public. Banning discussion is almost never correct, and it’s certainly not a good habit.
Let’s go with this analogy. The good thing to do is ask a variety of experts for safety evaluations, run the code through a wide variety of tests, etc. The think NOT to do is keep the code a secret while looking for mistakes all by yourself.
No, if you don’t want to use code you don’t give the code to a variety of experts for safety evaluations but you simply don’t run the code.
Having a public discussion is like running the code untested on a mission critical system.
What utility do you think is gained by discussing the basilisk?
and it’s certainly not a good habit.
Strawman. This forum is not a place where things get habitually banned.
What utility do you think is gained by discussing the basilisk?
An interesting discussion that leads to better understanding of decision theories? Like, the same utility as is gained by any other discussion on LW, pretty much.
Strawman. This forum is not a place where things get habitually banned.
Sure, but you’re the one that was going on about the importance of the mindset and culture; since you brought it up in the context of banning discussion, it sounded like you were saying that such censorship was part of a mindset/culture that you approve of.
Like, the same utility as is gained by any other discussion on LW, pretty much.
Not every discussion on LW has the same utility.
You engage in a pattern of simplifying the subject and then complaining that your flawed understanding doesn’t make sense.
Sure, but you’re the one that was going on about the importance of the mindset and culture
LW doesn’t have a culture with habitual banning discussion. Claiming that it has it is wrong.
I’m claiming that particular actions of Eliezer come out of being concerned about safety. I don’t claim that Eliezer engages in habitual banning on LW because of those concerns.
Just FYI, if you want a productive discussion you should hold back on accusing your opponents of fallacies. Ironically, since I never claimed that you claimed Eliezer engages in habitual banning on LW, your accusation that I made a strawman argument is itself a strawman argument.
Anyway, we’re not getting anywhere, so let’s disengage.
The wiki article talks more about this; I don’t think I can give the whole story in a short, accessible way.
It’s true that LessWrongers endorse ideas like AI catastrophe, Hofstadter’s superrationality, one-boxing in Newcomb’s problem, and various ideas in the neighborhood of utilitarianism; and those ideas are weird and controversial; and some criticism of Roko’s basilisk are proxies for a criticism of one of those views. But in most cases it’s a proxy for a criticism like ‘LW users are panicky about weird obscure ideas in decision theory’ (as in Auerbach’s piece), ‘LWers buy into Pascal’s Wager’, or ‘LWers use Roko’s Basilisk to scare up donations/support’.
So, yes, I think people’s real criticisms aren’t the same as their surface criticisms; but the real criticisms are at least as bad as the surface criticism, even from the perspective of someone who thinks LW users are wrong about AI, decision theory, meta-ethics, etc. For example, someone who thinks LWers are overly panicky about AI and overly fixated on decision theory should still reject Auerbach’s assumption that LWers are irrationally panicky about Newcomb’s Problem or acausal blackmail; the one doesn’t follow from the other.
I’m not sure what your point is here. Would you mind re-phrasing? (I’m pretty sure I understand the history of Roko’s Basilisk, so your explanation can start with that assumption.)
For example, someone who thinks LWers are overly panicky about AI and overly fixated on decision theory should still reject Auerbach’s assumption that LWers are irrationally panicky about Newcomb’s Problem or acausal blackmail; the one doesn’t follow from the other.
My point was that LWers are irrationally panicky about acausal blackmail: they think Basilisks are plausible enough that they ban all discussion of them!
If you’re saying ‘LessWrongers think there’s a serious risk they’ll be acausally blackmailed by a rogue AI’, then that seems to be false. That even seems to be false in Eliezer’s case, and Eliezer definitely isn’t ‘LessWrong’. If you’re saying ‘LessWrongers think acausal trade in general is possible,’ then that seems true but I don’t see why that’s ridiculous.
Is there something about acausal trade in general that you’re objecting to, beyond the specific problems with Roko’s argument?
Sorry, I’ll be more concrete; “there’s a serious risk” is really vague wording. What would surprise me greatly is if I heard that Eliezer assigned even a 5% probability to there being a realistic quick fix to Roko’s argument that makes it work on humans. I think a larger reason for the ban was just that Eliezer was angry with Roko for trying to spread what Roko thought was an information hazard, and angry people lash out (even when it doesn’t make a ton of strategic sense).
Probably not a quick fix, but I would definitely say Eliezer gives significant chances (say, 10%) to there being some viable version of the Basilisk, which is why he actively avoids thinking about it.
If Eliezer was just angry at Roko, he would have yelled or banned Roko; instead, he banned all discussion of the subject. That doesn’t even make sense as a “slashing out” reaction against Roko.
It sounds like you have a different model of Eliezer (and of how well-targeted ‘lashing out’ usually is) than I do. But, like I said to V_V above:
According to Eliezer, he had three separate reasons for the original ban: (1) he didn’t want any additional people (beyond the one Roko cited) to obsess over the idea and get nightmares; (2) he was worried there might be some variant on Roko’s argument that worked, and he wanted more formal assurances that this wasn’t the case; and (3) he was just outraged at Roko. (Including outraged at him for doing something Roko thought would put people at risk of torture.)
The point I was making wasn’t that (2) had zero influence. It was that (2) probably had less influence than (3), and its influence was probably of the ‘small probability of large costs’ variety.
I don’t know enough about this to tell if (2) had more influence than (3) initially. I’m glad you agree that (2) had some influence, at least. That was the main part of my point.
How long did discussion of the Basilisk stay banned? Wasn’t it many years? How do you explain that, unless the influence of (2) was significant?
It seems unlikely that they would, if their gun is some philosophical decision theory stuff about blackmail from their future. I don’t expect that gun to ever fire, no matter how many times you click the trigger.
If you’re saying ‘LessWrongers think there’s a serious risk they’ll be acausally blackmailed by a rogue AI’, then that seems to be false. That even seems to be false in Eliezer’s case,
Is it?
Assume that: a) There will be a future AI so powerful to torture people, even posthumously (I think this is quite speculative, but let’s assume it for the sake of the argument). b) This AI will be have a value system based on some form of utilitarian ethics. c) This AI will use an “acausal” decision theory (one that one-boxes in Newcomb’s problem).
Under these premises it seems to me that Roko’s argument is fundamentally correct.
As far as I can tell, belief in these premises was not only common in LessWrong at that time, but it was essentially the officially endorsed position of Eliezer Yudkowsky and SIAI. Therefore, we can deduce that EY should have believed that Roko’s argument was correct.
But EY claims that he didn’t believe that Roko’s argument was correct. So the question is: is EY lying?
His behavior was certainly consistent with him believing Roko’s argument. If he wanted to prevent the diffusion of that argument, then even lying about its correctness seems consistent.
So, is he lying? If he is not lying, then why didn’t he believe Roko’s argument? As far as I know, he never provided a refutation.
This was addressed on the LessWrongWiki page; I didn’t copy the full article here.
A few reasons Roko’s argument doesn’t work:
1 - Logical decision theories are supposed to one-box on Newcomb’s problem because it’s globally optimal even though it’s not optimal with respect to causally downstream events. A decision theory based on this idea could follow through on blackmail threats even when doing so isn’t causally optimal, which appears to put past agents at risk of coercion by future agents. But such a decision theory also prescribes ‘don’t be the kind of agent that enters into trades that aren’t globally optimal, even if the trade is optimal with respect to causally downstream events’. In other words, if you can bind yourself to precommitments to follow through on acausal blackmail, then it should also be possible to bind yourself to precommitments to ignore threats of blackmail.
The ‘should’ here is normative: there are probably some decision theories that let agents acausally blackmail each other, but others that perform well in Newcomb’s problem and the smoking lesion problem but can’t acausally blackmail each other; it hasn’t been formally demonstrated which theories fall into which category.
2 - Assuming you for some reason are following a decision theory that does put you at risk of acausal blackmail: Since the hypothetical agent is superintelligent, it has lots of ways to trick people into thinking it’s going to torture people without actually torturing them. Since this is cheaper, it would rather do that. And since we’re aware of this, we know any threat of blackmail would be empty. This means that we can’t be blackmailed in practice.
3 - A stronger version of 2 is that rational agents actually have an incentive to harshly punish attempts at blackmail in order to discourage it. So threatening blackmail can actually decrease an agent’s probability of being created, all else being equal.
4 - Insofar as it’s “utilitarian” to horribly punish anyone who doesn’t perfectly promote human flourishing, SIAI doesn’t seem to have endorsed utilitarianism.
4 means that the argument lacks practical relevance. The idea of CEV doesn’t build in very much moral philosophy, and it doesn’t build in predictions about the specific dilemmas future agents might end up in.
Assuming you for some reason are following a decision theory that does put you at risk of acausal blackmail: Since the hypothetical agent is superintelligent, it has lots of ways to trick people into thinking it’s going to torture people without actually torturing them. Since this is cheaper, it would rather do that. And since we’re aware of this, we know any threat of blackmail would be empty.
Um, your conclusion “since we’re aware of this, we know any threat of blackmail would be empty” contradicts your premise that the AI by virtue of being super-intelligent is capable of fooling people into thinking it’ll torture them.
One way of putting this is that the AI, once it exists, can convincingly trick people into thinking it will cooperate in Prisoner’s Dilemmas; but since we know it has this property and we know it prefers (D,C) over (C,C), we know it will defect. This is consistent because we’re assuming the actual AI is powerful enough to trick people once it exists; this doesn’t require the assumption that my low-fidelity mental model of the AI is powerful enough to trick me in the real world.
For acausal blackmail to work, the blackmailer needs a mechanism for convincing the blackmailee that it will follow through on its threat. ‘I’m a TDT agent’ isn’t a sufficient mechanism, because a TDT agent’s favorite option is still to trick other agents into cooperating in Prisoner’s Dilemmas while they defect.
1 - Humans can’t reliably precommit. Even if they could, precommittment is different than using an “acausal” decision theory. You don’t need precommitment to one-box in Newcomb’s problem, and the ability to precommit doesn’t guarantee by itself that you will one-box. In an adversarial game where the players can precommit and use a causal version of game theory, the one that can precommit first generally wins. E.g. Alice can precommit to ignore Bob’s threats, but she has no incentive to do so if Bob already precommitted to ignore Alice’s precommitments, and so on. If you allow for “acausal” reasoning, then even having a time advantage doesn’t work: if Bob isn’t born yet, but Alice predicts that she will be in an adversarial game with Bob and Bob will reason acausally and therefore he will have an incentive to threaten her and ignore her precommitments, then she has an incentive not to make such precommitment.
2 - This implies that the future AI uses a decision theory that two-boxes in Newcomb’s problem, contradicting the premise that it one-boxes.
3 - This implies that the future AI will have a deontological rule that says “Don’t blackmail” somehow hard-coded in it, contradicting the premise that it will be an utilitarian. Indeed, humans may want to build an AI with such constants, but in order to do so they will have to consider the possibility of blackmail and likely reject utilitarianism, which was the point of Roko’s argument.
Humans don’t follow any decision theory consistently. They sometimes give in to blackmail, and at other times resist blackmail. If you convinced a bunch of people to take acausal blackmail seriously, presumably some subset would give in and some subset would resist, since that’s what we see in ordinary blackmail situations. What would be interesting is if (a) there were some applicable reasoning norm that forced us to give in to acausal blackmail on pain of irrationality, or (b) there were some known human irrationality that made us inevitably susceptible to acausal blackmail. But I don’t think Roko gave a good argument for either of those claims.
From my last comment: “there are probably some decision theories that let agents acausally blackmail each other”. But if humans frequently make use of heuristics like ‘punish blackmailers’ and ‘never give in to blackmailers’, and if normative decision theory says they’re right to do so, there’s less practical import to ‘blackmailable agents are possible’.
This implies that the future AI uses a decision theory that two-boxes in Newcomb’s problem, contradicting the premise that it one-boxes.
No it doesn’t. If you model Newcomb’s problem as a Prisoner’s Dilemma, then one-boxing maps on to cooperating and two-boxing maps on to defecting. For Omega, cooperating means ‘I put money in both boxes’ and defecting means ‘I put money in just one box’. TDT recognizes that the only two options are mutual cooperation or mutual defection, so TDT cooperates.
Blackmail works analogously. Perhaps the blackmailer has five demands. For the blackmailee, full cooperation means ‘giving in to all five demands’; full defection means ‘rejecting all five demands’; and there are also intermediary levels (e.g., giving in to two demands while rejecting the other three), with the blackmailee prefer to do as little as possible.
For the blackmailer, full cooperation means ‘expending resources to punish the blackmailee in proportion to how many of my demands were met’. Full defection means ‘expending no resources to punish the blackmailee even if some demands aren’t met’. In other words, since harming past agents is costly, a blackmailer’s favorite scenario is always ‘the blackmailee, fearing punishment, gives in to most or all of my demands; but I don’t bother punishing them regardless of how many of my demands they ignored’. We could say that full defection doesn’t even bother to check how many of the demands were met, except insofar as this is useful for other goals.
The blackmailer wants to look as scary as possible (to get the blackmailee to cooperate) and then defect at the last moment anyway (by not following through on the threat), if at all possible. In terms of Newcomb’s problem, this is the same as preferring to trick Omega into thinking you’ll one-box, and then two-boxing anyway. We usually construct Newcomb’s problem in such a way that this is impossible; therefore TDT cooperates. But in the real world mutual cooperation of this sort is difficult to engineer, which makes fully credible acausal blackmail at least as difficult.
This implies that the future AI will have a deontological rule that says “Don’t blackmail” somehow hard-coded in it, contradicting the premise that it will be an utilitarian.
I think you misunderstood point 3. 3 is a follow-up to 2: humans and AI systems alike have incentives to discourage blackmail, which increases the likelihood that blackmail is a self-defeating strategy.
Shut up and multiply.
Eliezer has endorsed the claim “two independent occurrences of a harm (not to the same person, not interacting with each other) are exactly twice as bad as one”. This doesn’t tell us how bad the act of blackmail itself is, it doesn’t tell us how faithfully we should implement that idea in autonomous AI systems, and it doesn’t tell us how likely it is that a superintelligent AI would find itself forced into this particular moral dilemma.
Since Eliezer asserts a CEV-based agent wouldn’t blackmail humans, the next step in shoring up Roko’s argument would be to do more to connect the dots from “two independent occurrences of a harm (not to the same person, not interacting with each other) are exactly twice as bad as one” to a real-world worry about AI systems actually blackmailing people conditional on claims (a) and (c). ‘I find it scary to think a superintelligent AI might follow the kind of reasoning that can ever privilege torture over dust specks’ is not the same thing as ‘I’m scared a superintelligent AI will actually torture people because this will in fact be the best way to prevent a superastronomically large number of dust specks from ending up in people’s eyes’, so Roko’s particular argument has a high evidential burden.
“I precommit to shop at the store with the lowest price within some large distance, even if the cost of the gas and car depreciation to get to a farther store is greater than the savings I get from its lower price. If I do that, stores will have to compete with distant stores based on price, and thus it is more likely that nearby stores will have lower prices. However, this precommitment would only work if I am actually willing to go to the farther store when it has the lowest price even if I lose money”.
You’ve described the mechanism by which the precommitment happened, not actually disputed whether it happens.
Many “irrational” actions by human beings can be analyzed as precommitment; for instance, wanting to take revenge on people who have hurt you even if the revenge doesn’t get you anything.
My point is that to my knowledge, given the evidence that I have about his beliefs at that time, and his actions, and assuming that I’m not misunderstanding them or Roko’s argument, then it seems that there is a significant probability that EY lied about not beliving that Roko’s argument was correct.
If a philosophical framework causes you to accept a basilisk, I view that as grounds for rejecting the framework, not for accepting the basilisk. The basilisk therefore poses no danger at all to me: if someone presented me with a valid version, it would merely cause me to reconsider my decision theory or something. As a consequence, I’m in favor of discussing basilisks as much as possible (the opposite of EY’s philosophy).
One of my main problems with LWers is that they swallow too many bullets. Sometimes bullets should be dodged. Sometimes you should apply modus tollens and not modus ponens. The basilisk is so a priori implausible that you should be extremely suspicious of fancy arguments claiming to prove it.
To state it yet another way: to me, the basilisk has the same status as an ontological argument for God. Even if I can’t find the flaw in the argument, I’m confident in rejecting it anyway.
The basilisk is so a priori implausible that you should be extremely suspicious of fancy arguments claiming to prove it.
So are: God, superintelligent AI, universal priors, radical life extension, and any really big idea whatever; as well as the impossibility of each of these.
Plausibility is fine as a screening process for deciding where you’re going to devote your efforts, but terrible as an epistemological tool.
Somehow, blackmail from the future seems less plausible to me than every single one of your examples. Not sure why exactly.
How plausible do you find TDT and related decision theories as normative accounts of decision making, or at least as work towards such accounts? They open whole new realms of situations like Pascal’s Mugging, of which Roko’s Basilisk is one. If you’re going to think in detail about such decision theories, and adopt one as normative, you need to have an answer to these situations.
Once you’ve decided to study something seriously, the plausibility heuristic is no longer available.
I find TDT to be basically bullshit except possibly when it is applied to entities which literally see each others’ code, in which case I’m not sure (I’m not even sure if the concept of “decision” even makes sense in that case).
I’d go so far as to say that anyone who advocates cooperating in a one-shot prisoners’ dilemma simply doesn’t understand the setting. By definition, defecting gives you a better outcome than cooperating. Anyone who claims otherwise is changing the definition of the prisoners’ dilemma.
Defecting gives you a better outcome than cooperating if your decision is uncorrelated with the other players’. Different humans’ decisions aren’t 100% correlated, but they also aren’t 0% correlated, so the rationality of cooperating in the one-shot PD varies situationally for humans.
Part of the reason why humans often cooperate in PD-like scenarios in the real world is probably that there’s uncertainty about how iterated the PD is (and our environment of evolutionary adaptedness had a lot more iterated encounters than once-off encounters). But part of the reason for cooperation is probably also that we’ve evolved to do a very weak and probabilistic version of ‘source code sharing’: we’ve evolved to (sometimes) involuntarily display veridical evidence of our emotions, personality, etc. -- as opposed to being in complete control of the information we give others about our dispositions.
Because they’re at least partly involuntary and at least partly veridical, ‘tells’ give humans a way to trust each other even when there are no bad consequences to betrayal—which means at least some people can trust each other at least some of the time to uphold contracts in the absence of external enforcement mechanisms. See also Newcomblike Problems Are The Norm.
Defecting gives you a better outcome than cooperating if your decision is uncorrelated with the other players’. Different humans’ decisions aren’t 100% correlated, but they also aren’t 0% correlated, so the rationality of cooperating in the one-shot PD varies situationally for humans.
You’re confusing correlation with causation. Different players’ decision may be correlated, but they sure as hell aren’t causative of each other (unless they literally see each others’ code, maybe).
But part of the reason for cooperation is probably also that we’ve evolved to do a very weak and probabilistic version of ‘source code sharing’: we’ve evolved to (sometimes) involuntarily display veridical evidence of our emotions, personality, etc. -- as opposed to being in complete control of the information we give others about our dispositions.
Calling this source code sharing, instead of just “signaling for the purposes of a repeated game”, seems counter-productive. Yes, I agree that in a repeated game, the situation is trickier and involves a lot of signaling. The one-shot game is much easier: just always defect. By definition, that’s the best strategy.
You’re confusing correlation with causation. Different players’ decision may be correlated, but they sure as hell aren’t causative of each other (unless they literally see each others’ code, maybe). [...] The one-shot game is much easier: just always defect. By definition, that’s the best strategy.
Imagine you are playing against a clone of yourself. Whatever you do, the clone will do the exact same thing. If you choose to cooperate, he will choose to cooperate. If you choose to defect, he chooses to defect.
The best choice is obviously to cooperate.
So there are situations where cooperating is optimal. Despite there not being any causal influence between the players at all.
I think these kinds of situations are so exceedingly rare and unlikely they aren’t worth worrying about. For all practical purposes, the standard game theory logic is fine. But it’s interesting that they exist. And some people are so interested by that, that they’ve tried to formalize decision theories that can handle these situations. And from there you can possibly get counter-intuitive results like the basilisk.
If I’m playing my clone, it’s not clear that even saying that I’m making a choice is well-defined. After all, my choice will be what my code dictates it will be. Do I prefer that my code cause me to accept? Sure, but only because we stipulated that the other player shares the exact same code; it’s more accurate to say that I prefer my opponent’s code to cause him to defect, and it just so happens that his code is the same as mine.
In real life, my code is not the same as my opponent’s, and when I contemplate a decision, I’m only thinking about what I want my code to say. Nothing I do changes what my opponent does; therefore, defecting is correct.
Let me restate once more: the only time I’d ever want to cooperate in a one-shot prisoners’ dilemma was if I thought my decision could affect my opponent’s decision. If the latter is the case, though, then I’m not sure if the game was even a prisoners’ dilemma to begin with; instead it’s some weird variant where the players don’t have the ability to independently make decisions.
If I’m playing my clone, it’s not clear that even saying that I’m making a choice is well-defined. After all, my choice will be what my code dictates it will be. Do I prefer that my code cause me to accept? Sure, but only because we stipulated that the other player shares the exact same code; it’s more accurate to say that I prefer my opponent’s code to cause him to defect, and it just so happens that his code is the same as mine.
I think you are making this more complicated than it needs to be. You don’t need to worry about your code. All you need to know that it’s an exact copy of you playing. And that he will make the same decision you do. No matter how hard you think about your “code” or wish he would make a different choice, he will just do the same thing about you.
In real life, my code is not the same as my opponent’s, and when I contemplate a decision, I’m only thinking about what I want my code to say. Nothing I do changes what my opponent does; therefore, defecting is correct.
In real games with real humans, yes, usually. As I said, I don’t think these cases are common enough to worry about. But I’m just saying they exist.
But it is more general than just clones. If you know your opponent isn’t exactly the same as you, but still follows the same decision algorithm in this case, the principle is still valid. If you cooperate, he will cooperate. Because you are both following the same process to come to a decision.
the only time I’d ever want to cooperate in a one-shot prisoners’ dilemma was if I thought my decision could affect my opponent’s decision.
Well there is no causal influence. Your opponent is deterministic. His choice may have already been made and nothing you do will change it. And yet the best decision is still to cooperate.
Well there is no causal influence. Your opponent is deterministic. His choice may have already been made and nothing you do will change it. And yet the best decision is still to cooperate.
If his choice is already made and nothing I do will change it, then by definition my choice is already made and nothing I do will change it. That’s why my “decision” in this setting is not even well-defined—I don’t really have free will if external agents already know what I will do.
Yes. The universe is deterministic. Your actions are completely predictable, in principle. That’s not unique to this thought experiment. That’s true for every thing you do. You still have to make a choice. Cooperate or defect?
Yes. The universe is deterministic. Your actions are completely predictable, in principle. That’s not unique to this thought experiment. That’s true for every thing you do. You still have to make a choice. Cooperate or defect?
Um, what? First of all, the universe is not deterministic—quantum mechanics means there’s inherent randomness. Secondly, as far as we know, it’s consistent with the laws of physics that my actions are fundamentally unpredictable—see here.
Third, if I’m playing against a clone of myself, I don’t think it’s even a valid PD. Can the utility functions ever differ between me and my clone? Whenever my clone gets utility, I get utility, because there’s no physical way to distinguish between us (I have no way of saying which copy “I” am). But if we always have the exact same utility—if his happiness equals my happiness—then constructing a PD game is impossible.
Finally, even if I agree to cooperate against my clone, I claim this says nothing about cooperating versus other people. Against all agents that don’t have access to my code, the correct strategy in a one-shot PD is to defect, but first do/say whatever causes my opponent to cooperate. For example, if I was playing against LWers, I might first rant on about TDT or whatever, agree with my opponent’s philosophy as much as possible, etc., etc., and then defect in the actual game. (Note again that this only applies to one-shot games).
Even if you’re playing against a clone, you can distinguish the copies by where they are in space and so on. You can see which side of the room you are on, so you know which one you are. That means one of you can get utility without the other one getting it.
People don’t actually have the same code, but they have similar code. If the code in some case is similar enough that you can’t personally tell the difference, you should follow the same rule as when you are playing against a clone.
You can see which side of the room you are on, so you know which one you are.
If I can do this, then my clone and I can do different things. In that case, I can’t be guaranteed that if I cooperate, my clone will too (because my decision might have depended on which side of the room I’m on). But I agree that the cloning situation is strange, and that I might cooperate if I’m actually faced with it (though I’m quite sure that I never will).
People don’t actually have the same code, but they have similar code. If the code in some case is similar enough that you can’t personally tell the difference, you should follow the same rule as when you are playing against a clone.
How do you know if people have “similar” code to you? See, I’m anonymous on this forum, but in real life, I might pretend to believe in TDT and pretend to have code that’s “similar” to people around me (whatever that means—code similarity is not well-defined). So you might know me in real life. If so, presumably you’d cooperate if we played a PD, because you’d believe our code is similar. But I will defect (if it’s a one-time game). My strategy seems strictly superior to yours—I always get more utility in one-shot PDs.
I would cooperate with you if I couldn’t distinguish my code from yours, even if there might be minor differences, even in a one-shot case, because the best guess I would have of what you would do is that you would do the same thing that I do.
But since you’re making it clear that your code is quite different, and in a particular way, I would defect against you.
But since you’re making it clear that your code is quite different, and in a particular way, I would defect against you.
You don’t know who I am! I’m anonymous! Whoever you’d cooperate with, I might be that person (remember, in real life I pretend to have a completely different philosophy on this matter). Unless you defect against ALL HUMANS, you risk cooperating when facing me, since you don’t know what my disguise will be.
Oh, yes, me too. I want to engage in one-shot PD games with entirelyuseless (as opposed to other people), because he or she will give me free utility if I sell myself right. I wouldn’t want to play one-shot PDs against myself, in the same way that I wouldn’t want to play chess against Kasparov.
By the way, note that I usually cooperate in repeated PD games, and most real-life PDs are repeated games. In addition, my utility function takes other people into consideration; I would not screw people over for small personal gains, because I care about their happiness. In other words, defecting in one-shot PDs is entirely consistent with being a decent human being.
You’re confusing correlation with causation. Different players’ decision may be correlated, but they sure as hell aren’t causative of each other (unless they literally see each others’ code, maybe).
Causation isn’t necessary. You’re right that correlation isn’t quite sufficient, though!
What’s needed for rational cooperation in the prisoner’s dilemma is a two-way dependency between A and B’s decision-making. That can be because A is causally impacting B, or because B is causally impacting B; but it can also occur when there’s a common cause and neither is causing the other, like when my sister and I have similar genomes even though my sister didn’t create my genome and I didn’t create her genome. Or our decision-making processes can depend on each other because we inhabit the same laws of physics, or because we’re both bound by the same logical/mathematical laws—even if we’re on opposite sides of the universe.
(Dependence can also happen by coincidence, though if it’s completely random I’m not sure how’d you find out about it in order to act upon it!)
The most obvious example of cooperating due to acausal dependence is making two atom-by-atom-identical copies of an agent and put them in a one-shot prisoner’s dilemma against each other. But two agents whose decision-making is 90% similar instead of 100% identical can cooperate on those grounds too, provided the utility of mutual cooperation is sufficiently large.
For the same reason, a very large utility difference can rationally mandate cooperation even if cooperating only changes the probability of the other agent’s behavior from ’100% probability of defection’ to ‘99% probability of defection’.
Calling this source code sharing, instead of just “signaling for the purposes of a repeated game”, seems counter-productive.
I disagree! “Code-sharing” risks confusing someone into thinking there’s something magical and privileged about looking at source code. It’s true this is an unusually rich and direct source of information (assuming you understand the code’s implications and are sure what you’re seeing is the real deal), but the difference between that and inferring someone’s embarrassment from a blush is quantitative, not qualitative.
Some sources of information are more reliable and more revealing than others; but the same underlying idea is involved whenever something is evidence about an agent’s future decisions. See: Newcomblike Problems are the Norm
Yes, I agree that in a repeated game, the situation is trickier and involves a lot of signaling. The one-shot game is much easier: just always defect. By definition, that’s the best strategy.
If you and the other player have common knowledge that you reason the same way, then the correct move is to cooperate in the one-shot game. The correct move is to defect when those conditions don’t hold strongly enough, though.
The most obvious example of cooperating due to acausal dependence is making two atom-by-atom-identical copies of an agent and put them in a one-shot prisoner’s dilemma against each other. But two agents whose decision-making is 90% similar instead of 100% identical can cooperate on those grounds too, provided the utility of mutual cooperation is sufficiently large.
I’m not sure what “90% similar” means. Either I’m capable of making decisions independently from my opponent, or else I’m not. In real life, I am capable of doing so. The clone situation is strange, I admit, but in that case I’m not sure to what extent my “decision” even makes sense as a concept; I’ll clearly decide whatever my code says I’ll decide. As soon as you start assuming copies of my code being out there, I stop being comfortable with assigning me free will at all.
Anyway, none of this applies to real life, not even approximately. In real life, my decision cannot change your decision at all; in real life, nothing can even come close to predicting a decision I make in advance (assuming I put even a little bit of effort into that decision).
If you’re concerned about blushing etc., then you’re just saying the best strategy in a prisoner’s dilemma involves signaling very strongly that you’re trustworthy. I agree that this is correct against most human opponents. But surely you agree that if I can control my microexpressions, it’s best to signal “I will cooperate” while actually defecting, right?
Let me just ask you the following yes or no question: do you agree that my “always defect, but first pretend to be whatever will convince my opponent to cooperate” strategy beats all other strategies for a realistic one-shot prisoners’ dilemma? By one-shot, I mean that people will not have any memory of me defecting against them, so I can suffer no ill effects from retaliation.
I’d go so far as to say that anyone who advocates cooperating in a one-shot prisoners’ dilemma simply doesn’t understand the setting. By definition, defecting gives you a better outcome than cooperating. Anyone who claims otherwise is changing the definition of the prisoners’ dilemma.
I think this is correct. I think the reason to cooperate is not to get the best personal outcome, but because you care about the other person. I think we have evolved to cooperate, or perhaps that should be stated as we have evolved to want to cooperate. We have evolved to value cooperating. Our values come from our genes and our memes, and both are subject to evolution, to natural selection. But we want to cooperate.
So if I am in a prisoner’s dilemma against another human, if I perceive that other human as “one of us,” I will choose cooperation. Essentially, I care about their outcome. But in a one-shot PD defecting is the “better” strategy. The problem is that with genetic and/or memetic evolution of cooperation, we are not playing in a one-shot PD. We are playing with a set of values that developed over many shots.
Of course we don’t always cooperate. But when we do cooperate in one-shot PD’s, it is because, in some sense, there are so darn many one-shot PD’s, especially in the universe of hypotheticals, that we effectively know there is no such thing as a one-shot PD. This should not be too hard to accept around here where people semi-routinely accept simulations of themselves or clones of themselves as somehow just as important as their actual selves. I.e. we don’t even accept the “one-shottedness” of ourselves.
I think the reason to cooperate is not to get the best personal outcome, but because you care about the other person.
If you have 100% identical consequentialist values to all other humans, then that means ‘cooperation’ and ‘defection’ are both impossible for humans (because they can’t be put in PDs). Yet it will still be correct to defect (given that your decision and the other player’s decision don’t strongly depend on each other) if you ever run into an agent that doesn’t share all your values. See The True Prisoner’s Dilemma.
This shows that the iterated dilemma and the dilemma-with-common-knowledge-of-rationality allow cooperation (i.e., giving up on your goal to enable someone else to achieve a goal you genuinely don’t want them to achieve), whereas loving compassion and shared values merely change goal-content. To properly visualize the PD, you need an actual value conflict—e.g., imagine you’re playing against a serial killer in a hostage negotiation. ‘Cooperating’ is just an English-language label; the important thing is the game-theoretic structure, which allows that sometimes ‘cooperating’ looks like letting people die in order to appease a killer’s antisocial goals.
To properly visualize the PD, you need an actual value conflict
I think belief conflicts might work, even if the same values are shared. Suppose you and I are at a control panel for three remotely wired bombs in population centers. Both of us want as many people to live as possible. One bomb will go off in ten seconds unless we disarm it, but the others will stay inert unless activated. I believe that pressing the green button causes all bombs to explode, and pressing the red button defuses the time bomb. You believe the same thing, but with the colors reversed. Both of us would rather that no buttons be pressed than both buttons be pressed, but each of us would prefer that just the defuse button be pressed, and that the other person not mistakenly kill all three groups. (Here, attempting to defuse is ‘defecting’ and not attempting to defuse is ‘cooperating’.)
[Edit]: As written, in terms of lives saved, this doesn’t have the property that (D,D)>(C,D); if I press my button, you are indifferent between pressing your button or not. So it’s not true that D strictly dominates C, but the important part of the structure is preserved, and a minor change could make it so D strictly dominates C.
I think belief conflicts might work, even if the same values are shared.
You can solve belief conflicts simply by trading in a prediction market with decision-contingent contracts (a “decision market”). Value conflicts are more general than that.
I think this is misusing the word “general.” Value conflicts are more narrow than the full class of games that have the PD preference ordering. I do agree that value conflicts are harder to resolve than belief conflicts, but that doesn’t make them more general.
If you have 100% identical consequentialist values to all other humans, then that means ‘cooperation’ and ‘defection’ are both impossible for humans (because they can’t be put in PDs). … To properly visualize the PD, you need an actual value conflict
True, but the flip side of this is that efficiency (in Coasian terms) is precisely defined as pursuing 100% identical consequentialist values, where the shared “values” are determined by a weighted sum of each agent’s utility function (and the weights are typically determined by agent endowments).
I think the reason to cooperate is not to get the best personal outcome, but because you care about the other person.
I just want to make it clear that by saying this, you’re changing the setting of the prisoners’ dilemma, so you shouldn’t even call it a prisoners’ dilemma anymore. The prisoners’ dilemma is defined so that you get more utility by defecting; if you say you care about your opponent’s utility enough to cooperate, it means you don’t get more utility by defecting, since cooperation gives you utility. Therefore, all you’re saying is that you can never be in a true prisoners’ dilemma game; you’re NOT saying that in a true PD, it’s correct to cooperate (again, by definition, it isn’t).
The most likely reason people are evolutionarily predisposed to cooperate in real-life PDs is that almost all real-life PDs are repeated games and not one-shot. Repeated prisoners’ dilemmas are completely different beasts, and it can definitely be correct to cooperate in them.
If a philosophical framework causes you to accept a basilisk, I view that as grounds for rejecting the framework, not for accepting the basilisk.
...
To state it yet another way: to me, the basilisk has the same status as an ontological argument for God. Even if I can’t find the flaw in the argument, I’m confident in rejecting it anyway.
Despite the other things I’ve said here, that is my attitude as well. But I recognise that when I take that attitude, I am not solving the problem, only ignoring it. It may be perfectly sensible to ignore a problem, even a serious one (comparative advantage etc.). But dissolving a paradox is not achieved by clinging to one of the conflicting thoughts and ignoring the others. (Bullet-swallowing seems to consist of seizing onto the most novel one.) Eliminating the paradox requires showing where and how the thoughts went wrong.
I agree that resolving paradoxes is an important intellectual exercise, and that I wouldn’t be satisfied with simply ignoring an ontological argument (I’d want to find the flaw). But the best way to find such flaws is to discuss the ideas with others. At no point should one assign such a high probability to ideas like Roko’s basilisk being actually sound that one refuses to discuss them with others.
If a philosophical framework causes you to accept a basilisk, I view that as grounds for rejecting the framework, not for accepting the basilisk.
...
To state it yet another way: to me, the basilisk has the same status as an ontological argument for God. Even if I can’t find the flaw in the argument, I’m confident in rejecting it anyway.
Despite the other things I’ve said here, that is my attitude as well. But I recognise that when I take that attitude, I am not solving the problem, only ignoring it. It may be perfectly sensible to ignore a problem, even a serious one (comparative advantage etc.). But dissolving a paradox is not achieved merely by clinging to one of the conflicting thoughts and ignoring the others. (Bullet-swallowing seems to consist of seizing onto the most implausible one.) Eliminating the paradox requires showing where and how the thoughts went wrong.
I think saying “Roko’s arguments [...] weren’t generally accepted by other Less Wrong users” is not giving the whole story. Yes, it is true that essentially nobody accepts Roko’s arguments exactly as presented. But a lot of LW users at least thought something along these lines was plausible. Eliezer thought it was so plausible that he banned discussion of it (instead of saying “obviously, information hazards cannot exist in real life, so there is no danger discussing them”).
In other words, while it is true that LWers didn’t believe Roko’s basilisk, they thought is was plausible instead of ridiculous. When people mock LW or Eliezer for believing in Roko’s Basilisk, they are mistaken, but not completely mistaken—if they simply switched to mocking LW for believing the basilisk is plausible, they would be correct (though the mocking would still be mean, of course).
If you are a programmer and think your code is safe because you see no way things could go wrong, it’s still not good to believe that it isn’t plausible that there’s a security hole in your code.
You rather practice defense in depth and plan for the possibility that things can go wrong somewhere in your code, so you add safety precautions. Even when there isn’t what courts call reasonable doubt a good safety engineer still adds additional safety procautions in security critical code. Eliezer deals with FAI safety. As a result it’s good for him to have mindset of really caring about safety.
German nuclear power station have trainings for their desk workers to teach the desk workers to not cut themselves with paper. That alone seems strange to outsiders but everyone in Germany thinks that it’s very important for nuclear power stations to foster a culture of safety even when that means something going overboard.
Cf. AI Risk and the Security Mindset.
Let’s go with this analogy. The good thing to do is ask a variety of experts for safety evaluations, run the code through a wide variety of tests, etc. The think NOT to do is keep the code a secret while looking for mistakes all by yourself. If you keep your code out of the public domain, it is more likely to have security issues, since it was not scrutinized by the public. Banning discussion is almost never correct, and it’s certainly not a good habit.
No, if you don’t want to use code you don’t give the code to a variety of experts for safety evaluations but you simply don’t run the code. Having a public discussion is like running the code untested on a mission critical system.
What utility do you think is gained by discussing the basilisk?
Strawman. This forum is not a place where things get habitually banned.
An interesting discussion that leads to better understanding of decision theories? Like, the same utility as is gained by any other discussion on LW, pretty much.
Sure, but you’re the one that was going on about the importance of the mindset and culture; since you brought it up in the context of banning discussion, it sounded like you were saying that such censorship was part of a mindset/culture that you approve of.
Not every discussion on LW has the same utility.
You engage in a pattern of simplifying the subject and then complaining that your flawed understanding doesn’t make sense.
LW doesn’t have a culture with habitual banning discussion. Claiming that it has it is wrong.
I’m claiming that particular actions of Eliezer come out of being concerned about safety. I don’t claim that Eliezer engages in habitual banning on LW because of those concerns.
It’s a complete strawman that you are making up.
Just FYI, if you want a productive discussion you should hold back on accusing your opponents of fallacies. Ironically, since I never claimed that you claimed Eliezer engages in habitual banning on LW, your accusation that I made a strawman argument is itself a strawman argument.
Anyway, we’re not getting anywhere, so let’s disengage.
The wiki article talks more about this; I don’t think I can give the whole story in a short, accessible way.
It’s true that LessWrongers endorse ideas like AI catastrophe, Hofstadter’s superrationality, one-boxing in Newcomb’s problem, and various ideas in the neighborhood of utilitarianism; and those ideas are weird and controversial; and some criticism of Roko’s basilisk are proxies for a criticism of one of those views. But in most cases it’s a proxy for a criticism like ‘LW users are panicky about weird obscure ideas in decision theory’ (as in Auerbach’s piece), ‘LWers buy into Pascal’s Wager’, or ‘LWers use Roko’s Basilisk to scare up donations/support’.
So, yes, I think people’s real criticisms aren’t the same as their surface criticisms; but the real criticisms are at least as bad as the surface criticism, even from the perspective of someone who thinks LW users are wrong about AI, decision theory, meta-ethics, etc. For example, someone who thinks LWers are overly panicky about AI and overly fixated on decision theory should still reject Auerbach’s assumption that LWers are irrationally panicky about Newcomb’s Problem or acausal blackmail; the one doesn’t follow from the other.
I’m not sure what your point is here. Would you mind re-phrasing? (I’m pretty sure I understand the history of Roko’s Basilisk, so your explanation can start with that assumption.)
My point was that LWers are irrationally panicky about acausal blackmail: they think Basilisks are plausible enough that they ban all discussion of them!
(Not all LWers, of course.)
If you’re saying ‘LessWrongers think there’s a serious risk they’ll be acausally blackmailed by a rogue AI’, then that seems to be false. That even seems to be false in Eliezer’s case, and Eliezer definitely isn’t ‘LessWrong’. If you’re saying ‘LessWrongers think acausal trade in general is possible,’ then that seems true but I don’t see why that’s ridiculous.
Is there something about acausal trade in general that you’re objecting to, beyond the specific problems with Roko’s argument?
It seems we disagree on this factual issue. Eliezer does think there is a risk of acausal blackmail, or else he wouldn’t have banned discussion of it.
Sorry, I’ll be more concrete; “there’s a serious risk” is really vague wording. What would surprise me greatly is if I heard that Eliezer assigned even a 5% probability to there being a realistic quick fix to Roko’s argument that makes it work on humans. I think a larger reason for the ban was just that Eliezer was angry with Roko for trying to spread what Roko thought was an information hazard, and angry people lash out (even when it doesn’t make a ton of strategic sense).
Probably not a quick fix, but I would definitely say Eliezer gives significant chances (say, 10%) to there being some viable version of the Basilisk, which is why he actively avoids thinking about it.
If Eliezer was just angry at Roko, he would have yelled or banned Roko; instead, he banned all discussion of the subject. That doesn’t even make sense as a “slashing out” reaction against Roko.
It sounds like you have a different model of Eliezer (and of how well-targeted ‘lashing out’ usually is) than I do. But, like I said to V_V above:
The point I was making wasn’t that (2) had zero influence. It was that (2) probably had less influence than (3), and its influence was probably of the ‘small probability of large costs’ variety.
I don’t know enough about this to tell if (2) had more influence than (3) initially. I’m glad you agree that (2) had some influence, at least. That was the main part of my point.
How long did discussion of the Basilisk stay banned? Wasn’t it many years? How do you explain that, unless the influence of (2) was significant?
I believe he thinks that sufficiently clever idiots competing to shoot off their own feet will find some way to do so.
It seems unlikely that they would, if their gun is some philosophical decision theory stuff about blackmail from their future. I don’t expect that gun to ever fire, no matter how many times you click the trigger.
That is not what I said, and I’m also guessing you did not have a grandfather who taught you you gun safety.
Is it?
Assume that:
a) There will be a future AI so powerful to torture people, even posthumously (I think this is quite speculative, but let’s assume it for the sake of the argument).
b) This AI will be have a value system based on some form of utilitarian ethics.
c) This AI will use an “acausal” decision theory (one that one-boxes in Newcomb’s problem).
Under these premises it seems to me that Roko’s argument is fundamentally correct.
As far as I can tell, belief in these premises was not only common in LessWrong at that time, but it was essentially the officially endorsed position of Eliezer Yudkowsky and SIAI. Therefore, we can deduce that EY should have believed that Roko’s argument was correct.
But EY claims that he didn’t believe that Roko’s argument was correct. So the question is: is EY lying?
His behavior was certainly consistent with him believing Roko’s argument. If he wanted to prevent the diffusion of that argument, then even lying about its correctness seems consistent.
So, is he lying? If he is not lying, then why didn’t he believe Roko’s argument? As far as I know, he never provided a refutation.
This was addressed on the LessWrongWiki page; I didn’t copy the full article here.
A few reasons Roko’s argument doesn’t work:
1 - Logical decision theories are supposed to one-box on Newcomb’s problem because it’s globally optimal even though it’s not optimal with respect to causally downstream events. A decision theory based on this idea could follow through on blackmail threats even when doing so isn’t causally optimal, which appears to put past agents at risk of coercion by future agents. But such a decision theory also prescribes ‘don’t be the kind of agent that enters into trades that aren’t globally optimal, even if the trade is optimal with respect to causally downstream events’. In other words, if you can bind yourself to precommitments to follow through on acausal blackmail, then it should also be possible to bind yourself to precommitments to ignore threats of blackmail.
The ‘should’ here is normative: there are probably some decision theories that let agents acausally blackmail each other, but others that perform well in Newcomb’s problem and the smoking lesion problem but can’t acausally blackmail each other; it hasn’t been formally demonstrated which theories fall into which category.
2 - Assuming you for some reason are following a decision theory that does put you at risk of acausal blackmail: Since the hypothetical agent is superintelligent, it has lots of ways to trick people into thinking it’s going to torture people without actually torturing them. Since this is cheaper, it would rather do that. And since we’re aware of this, we know any threat of blackmail would be empty. This means that we can’t be blackmailed in practice.
3 - A stronger version of 2 is that rational agents actually have an incentive to harshly punish attempts at blackmail in order to discourage it. So threatening blackmail can actually decrease an agent’s probability of being created, all else being equal.
4 - Insofar as it’s “utilitarian” to horribly punish anyone who doesn’t perfectly promote human flourishing, SIAI doesn’t seem to have endorsed utilitarianism.
4 means that the argument lacks practical relevance. The idea of CEV doesn’t build in very much moral philosophy, and it doesn’t build in predictions about the specific dilemmas future agents might end up in.
Um, your conclusion “since we’re aware of this, we know any threat of blackmail would be empty” contradicts your premise that the AI by virtue of being super-intelligent is capable of fooling people into thinking it’ll torture them.
One way of putting this is that the AI, once it exists, can convincingly trick people into thinking it will cooperate in Prisoner’s Dilemmas; but since we know it has this property and we know it prefers (D,C) over (C,C), we know it will defect. This is consistent because we’re assuming the actual AI is powerful enough to trick people once it exists; this doesn’t require the assumption that my low-fidelity mental model of the AI is powerful enough to trick me in the real world.
For acausal blackmail to work, the blackmailer needs a mechanism for convincing the blackmailee that it will follow through on its threat. ‘I’m a TDT agent’ isn’t a sufficient mechanism, because a TDT agent’s favorite option is still to trick other agents into cooperating in Prisoner’s Dilemmas while they defect.
Except it needs to convince the people who are around before it exists.
1 - Humans can’t reliably precommit. Even if they could, precommittment is different than using an “acausal” decision theory. You don’t need precommitment to one-box in Newcomb’s problem, and the ability to precommit doesn’t guarantee by itself that you will one-box. In an adversarial game where the players can precommit and use a causal version of game theory, the one that can precommit first generally wins. E.g. Alice can precommit to ignore Bob’s threats, but she has no incentive to do so if Bob already precommitted to ignore Alice’s precommitments, and so on. If you allow for “acausal” reasoning, then even having a time advantage doesn’t work: if Bob isn’t born yet, but Alice predicts that she will be in an adversarial game with Bob and Bob will reason acausally and therefore he will have an incentive to threaten her and ignore her precommitments, then she has an incentive not to make such precommitment.
2 - This implies that the future AI uses a decision theory that two-boxes in Newcomb’s problem, contradicting the premise that it one-boxes.
3 - This implies that the future AI will have a deontological rule that says “Don’t blackmail” somehow hard-coded in it, contradicting the premise that it will be an utilitarian. Indeed, humans may want to build an AI with such constants, but in order to do so they will have to consider the possibility of blackmail and likely reject utilitarianism, which was the point of Roko’s argument.
4 - Shut up and multiply.
Humans don’t follow any decision theory consistently. They sometimes give in to blackmail, and at other times resist blackmail. If you convinced a bunch of people to take acausal blackmail seriously, presumably some subset would give in and some subset would resist, since that’s what we see in ordinary blackmail situations. What would be interesting is if (a) there were some applicable reasoning norm that forced us to give in to acausal blackmail on pain of irrationality, or (b) there were some known human irrationality that made us inevitably susceptible to acausal blackmail. But I don’t think Roko gave a good argument for either of those claims.
From my last comment: “there are probably some decision theories that let agents acausally blackmail each other”. But if humans frequently make use of heuristics like ‘punish blackmailers’ and ‘never give in to blackmailers’, and if normative decision theory says they’re right to do so, there’s less practical import to ‘blackmailable agents are possible’.
No it doesn’t. If you model Newcomb’s problem as a Prisoner’s Dilemma, then one-boxing maps on to cooperating and two-boxing maps on to defecting. For Omega, cooperating means ‘I put money in both boxes’ and defecting means ‘I put money in just one box’. TDT recognizes that the only two options are mutual cooperation or mutual defection, so TDT cooperates.
Blackmail works analogously. Perhaps the blackmailer has five demands. For the blackmailee, full cooperation means ‘giving in to all five demands’; full defection means ‘rejecting all five demands’; and there are also intermediary levels (e.g., giving in to two demands while rejecting the other three), with the blackmailee prefer to do as little as possible.
For the blackmailer, full cooperation means ‘expending resources to punish the blackmailee in proportion to how many of my demands were met’. Full defection means ‘expending no resources to punish the blackmailee even if some demands aren’t met’. In other words, since harming past agents is costly, a blackmailer’s favorite scenario is always ‘the blackmailee, fearing punishment, gives in to most or all of my demands; but I don’t bother punishing them regardless of how many of my demands they ignored’. We could say that full defection doesn’t even bother to check how many of the demands were met, except insofar as this is useful for other goals.
The blackmailer wants to look as scary as possible (to get the blackmailee to cooperate) and then defect at the last moment anyway (by not following through on the threat), if at all possible. In terms of Newcomb’s problem, this is the same as preferring to trick Omega into thinking you’ll one-box, and then two-boxing anyway. We usually construct Newcomb’s problem in such a way that this is impossible; therefore TDT cooperates. But in the real world mutual cooperation of this sort is difficult to engineer, which makes fully credible acausal blackmail at least as difficult.
I think you misunderstood point 3. 3 is a follow-up to 2: humans and AI systems alike have incentives to discourage blackmail, which increases the likelihood that blackmail is a self-defeating strategy.
Eliezer has endorsed the claim “two independent occurrences of a harm (not to the same person, not interacting with each other) are exactly twice as bad as one”. This doesn’t tell us how bad the act of blackmail itself is, it doesn’t tell us how faithfully we should implement that idea in autonomous AI systems, and it doesn’t tell us how likely it is that a superintelligent AI would find itself forced into this particular moral dilemma.
Since Eliezer asserts a CEV-based agent wouldn’t blackmail humans, the next step in shoring up Roko’s argument would be to do more to connect the dots from “two independent occurrences of a harm (not to the same person, not interacting with each other) are exactly twice as bad as one” to a real-world worry about AI systems actually blackmailing people conditional on claims (a) and (c). ‘I find it scary to think a superintelligent AI might follow the kind of reasoning that can ever privilege torture over dust specks’ is not the same thing as ‘I’m scared a superintelligent AI will actually torture people because this will in fact be the best way to prevent a superastronomically large number of dust specks from ending up in people’s eyes’, so Roko’s particular argument has a high evidential burden.
“I precommit to shop at the store with the lowest price within some large distance, even if the cost of the gas and car depreciation to get to a farther store is greater than the savings I get from its lower price. If I do that, stores will have to compete with distant stores based on price, and thus it is more likely that nearby stores will have lower prices. However, this precommitment would only work if I am actually willing to go to the farther store when it has the lowest price even if I lose money”.
Miraculously, people do reliably act this way.
I doubt it. Reference?
Mostly because they don’t actually notice the cost of gas and car depreciation at the time...
You’ve described the mechanism by which the precommitment happened, not actually disputed whether it happens.
Many “irrational” actions by human beings can be analyzed as precommitment; for instance, wanting to take revenge on people who have hurt you even if the revenge doesn’t get you anything.
Lying is consistent with a lot of behavior. The fact that it is, is no basis to accuse people of lying.
I’m not accusing, I’m asking the question.
My point is that to my knowledge, given the evidence that I have about his beliefs at that time, and his actions, and assuming that I’m not misunderstanding them or Roko’s argument, then it seems that there is a significant probability that EY lied about not beliving that Roko’s argument was correct.
He’s almost certainly lying about what he believed back then. I have no idea if he’s lying about his current beliefs.
Why would they be correct? The basilisk is plausible.
If a philosophical framework causes you to accept a basilisk, I view that as grounds for rejecting the framework, not for accepting the basilisk. The basilisk therefore poses no danger at all to me: if someone presented me with a valid version, it would merely cause me to reconsider my decision theory or something. As a consequence, I’m in favor of discussing basilisks as much as possible (the opposite of EY’s philosophy).
One of my main problems with LWers is that they swallow too many bullets. Sometimes bullets should be dodged. Sometimes you should apply modus tollens and not modus ponens. The basilisk is so a priori implausible that you should be extremely suspicious of fancy arguments claiming to prove it.
To state it yet another way: to me, the basilisk has the same status as an ontological argument for God. Even if I can’t find the flaw in the argument, I’m confident in rejecting it anyway.
So are: God, superintelligent AI, universal priors, radical life extension, and any really big idea whatever; as well as the impossibility of each of these.
Plausibility is fine as a screening process for deciding where you’re going to devote your efforts, but terrible as an epistemological tool.
Somehow, blackmail from the future seems less plausible to me than every single one of your examples. Not sure why exactly.
How plausible do you find TDT and related decision theories as normative accounts of decision making, or at least as work towards such accounts? They open whole new realms of situations like Pascal’s Mugging, of which Roko’s Basilisk is one. If you’re going to think in detail about such decision theories, and adopt one as normative, you need to have an answer to these situations.
Once you’ve decided to study something seriously, the plausibility heuristic is no longer available.
I find TDT to be basically bullshit except possibly when it is applied to entities which literally see each others’ code, in which case I’m not sure (I’m not even sure if the concept of “decision” even makes sense in that case).
I’d go so far as to say that anyone who advocates cooperating in a one-shot prisoners’ dilemma simply doesn’t understand the setting. By definition, defecting gives you a better outcome than cooperating. Anyone who claims otherwise is changing the definition of the prisoners’ dilemma.
Defecting gives you a better outcome than cooperating if your decision is uncorrelated with the other players’. Different humans’ decisions aren’t 100% correlated, but they also aren’t 0% correlated, so the rationality of cooperating in the one-shot PD varies situationally for humans.
Part of the reason why humans often cooperate in PD-like scenarios in the real world is probably that there’s uncertainty about how iterated the PD is (and our environment of evolutionary adaptedness had a lot more iterated encounters than once-off encounters). But part of the reason for cooperation is probably also that we’ve evolved to do a very weak and probabilistic version of ‘source code sharing’: we’ve evolved to (sometimes) involuntarily display veridical evidence of our emotions, personality, etc. -- as opposed to being in complete control of the information we give others about our dispositions.
Because they’re at least partly involuntary and at least partly veridical, ‘tells’ give humans a way to trust each other even when there are no bad consequences to betrayal—which means at least some people can trust each other at least some of the time to uphold contracts in the absence of external enforcement mechanisms. See also Newcomblike Problems Are The Norm.
You’re confusing correlation with causation. Different players’ decision may be correlated, but they sure as hell aren’t causative of each other (unless they literally see each others’ code, maybe).
Calling this source code sharing, instead of just “signaling for the purposes of a repeated game”, seems counter-productive. Yes, I agree that in a repeated game, the situation is trickier and involves a lot of signaling. The one-shot game is much easier: just always defect. By definition, that’s the best strategy.
Imagine you are playing against a clone of yourself. Whatever you do, the clone will do the exact same thing. If you choose to cooperate, he will choose to cooperate. If you choose to defect, he chooses to defect.
The best choice is obviously to cooperate.
So there are situations where cooperating is optimal. Despite there not being any causal influence between the players at all.
I think these kinds of situations are so exceedingly rare and unlikely they aren’t worth worrying about. For all practical purposes, the standard game theory logic is fine. But it’s interesting that they exist. And some people are so interested by that, that they’ve tried to formalize decision theories that can handle these situations. And from there you can possibly get counter-intuitive results like the basilisk.
If I’m playing my clone, it’s not clear that even saying that I’m making a choice is well-defined. After all, my choice will be what my code dictates it will be. Do I prefer that my code cause me to accept? Sure, but only because we stipulated that the other player shares the exact same code; it’s more accurate to say that I prefer my opponent’s code to cause him to defect, and it just so happens that his code is the same as mine.
In real life, my code is not the same as my opponent’s, and when I contemplate a decision, I’m only thinking about what I want my code to say. Nothing I do changes what my opponent does; therefore, defecting is correct.
Let me restate once more: the only time I’d ever want to cooperate in a one-shot prisoners’ dilemma was if I thought my decision could affect my opponent’s decision. If the latter is the case, though, then I’m not sure if the game was even a prisoners’ dilemma to begin with; instead it’s some weird variant where the players don’t have the ability to independently make decisions.
I think you are making this more complicated than it needs to be. You don’t need to worry about your code. All you need to know that it’s an exact copy of you playing. And that he will make the same decision you do. No matter how hard you think about your “code” or wish he would make a different choice, he will just do the same thing about you.
In real games with real humans, yes, usually. As I said, I don’t think these cases are common enough to worry about. But I’m just saying they exist.
But it is more general than just clones. If you know your opponent isn’t exactly the same as you, but still follows the same decision algorithm in this case, the principle is still valid. If you cooperate, he will cooperate. Because you are both following the same process to come to a decision.
Well there is no causal influence. Your opponent is deterministic. His choice may have already been made and nothing you do will change it. And yet the best decision is still to cooperate.
If his choice is already made and nothing I do will change it, then by definition my choice is already made and nothing I do will change it. That’s why my “decision” in this setting is not even well-defined—I don’t really have free will if external agents already know what I will do.
Yes. The universe is deterministic. Your actions are completely predictable, in principle. That’s not unique to this thought experiment. That’s true for every thing you do. You still have to make a choice. Cooperate or defect?
Um, what? First of all, the universe is not deterministic—quantum mechanics means there’s inherent randomness. Secondly, as far as we know, it’s consistent with the laws of physics that my actions are fundamentally unpredictable—see here.
Third, if I’m playing against a clone of myself, I don’t think it’s even a valid PD. Can the utility functions ever differ between me and my clone? Whenever my clone gets utility, I get utility, because there’s no physical way to distinguish between us (I have no way of saying which copy “I” am). But if we always have the exact same utility—if his happiness equals my happiness—then constructing a PD game is impossible.
Finally, even if I agree to cooperate against my clone, I claim this says nothing about cooperating versus other people. Against all agents that don’t have access to my code, the correct strategy in a one-shot PD is to defect, but first do/say whatever causes my opponent to cooperate. For example, if I was playing against LWers, I might first rant on about TDT or whatever, agree with my opponent’s philosophy as much as possible, etc., etc., and then defect in the actual game. (Note again that this only applies to one-shot games).
Even if you’re playing against a clone, you can distinguish the copies by where they are in space and so on. You can see which side of the room you are on, so you know which one you are. That means one of you can get utility without the other one getting it.
People don’t actually have the same code, but they have similar code. If the code in some case is similar enough that you can’t personally tell the difference, you should follow the same rule as when you are playing against a clone.
If I can do this, then my clone and I can do different things. In that case, I can’t be guaranteed that if I cooperate, my clone will too (because my decision might have depended on which side of the room I’m on). But I agree that the cloning situation is strange, and that I might cooperate if I’m actually faced with it (though I’m quite sure that I never will).
How do you know if people have “similar” code to you? See, I’m anonymous on this forum, but in real life, I might pretend to believe in TDT and pretend to have code that’s “similar” to people around me (whatever that means—code similarity is not well-defined). So you might know me in real life. If so, presumably you’d cooperate if we played a PD, because you’d believe our code is similar. But I will defect (if it’s a one-time game). My strategy seems strictly superior to yours—I always get more utility in one-shot PDs.
I would cooperate with you if I couldn’t distinguish my code from yours, even if there might be minor differences, even in a one-shot case, because the best guess I would have of what you would do is that you would do the same thing that I do.
But since you’re making it clear that your code is quite different, and in a particular way, I would defect against you.
You don’t know who I am! I’m anonymous! Whoever you’d cooperate with, I might be that person (remember, in real life I pretend to have a completely different philosophy on this matter). Unless you defect against ALL HUMANS, you risk cooperating when facing me, since you don’t know what my disguise will be.
I will take that chance into account. Fortunately it is a low one and should hardly be a reason to defect against all humans.
Cool, so in conclusion, if we met in real life and played a one-shot PD, you’d (probably) cooperate and I’d defect. My strategy seems superior.
And yet I somehow find myself more inclined to engage in PD-like interactions with entirelyuseless than with your good self.
Oh, yes, me too. I want to engage in one-shot PD games with entirelyuseless (as opposed to other people), because he or she will give me free utility if I sell myself right. I wouldn’t want to play one-shot PDs against myself, in the same way that I wouldn’t want to play chess against Kasparov.
By the way, note that I usually cooperate in repeated PD games, and most real-life PDs are repeated games. In addition, my utility function takes other people into consideration; I would not screw people over for small personal gains, because I care about their happiness. In other words, defecting in one-shot PDs is entirely consistent with being a decent human being.
Causation isn’t necessary. You’re right that correlation isn’t quite sufficient, though!
What’s needed for rational cooperation in the prisoner’s dilemma is a two-way dependency between A and B’s decision-making. That can be because A is causally impacting B, or because B is causally impacting B; but it can also occur when there’s a common cause and neither is causing the other, like when my sister and I have similar genomes even though my sister didn’t create my genome and I didn’t create her genome. Or our decision-making processes can depend on each other because we inhabit the same laws of physics, or because we’re both bound by the same logical/mathematical laws—even if we’re on opposite sides of the universe.
(Dependence can also happen by coincidence, though if it’s completely random I’m not sure how’d you find out about it in order to act upon it!)
The most obvious example of cooperating due to acausal dependence is making two atom-by-atom-identical copies of an agent and put them in a one-shot prisoner’s dilemma against each other. But two agents whose decision-making is 90% similar instead of 100% identical can cooperate on those grounds too, provided the utility of mutual cooperation is sufficiently large.
For the same reason, a very large utility difference can rationally mandate cooperation even if cooperating only changes the probability of the other agent’s behavior from ’100% probability of defection’ to ‘99% probability of defection’.
I disagree! “Code-sharing” risks confusing someone into thinking there’s something magical and privileged about looking at source code. It’s true this is an unusually rich and direct source of information (assuming you understand the code’s implications and are sure what you’re seeing is the real deal), but the difference between that and inferring someone’s embarrassment from a blush is quantitative, not qualitative.
Some sources of information are more reliable and more revealing than others; but the same underlying idea is involved whenever something is evidence about an agent’s future decisions. See: Newcomblike Problems are the Norm
If you and the other player have common knowledge that you reason the same way, then the correct move is to cooperate in the one-shot game. The correct move is to defect when those conditions don’t hold strongly enough, though.
I’m not sure what “90% similar” means. Either I’m capable of making decisions independently from my opponent, or else I’m not. In real life, I am capable of doing so. The clone situation is strange, I admit, but in that case I’m not sure to what extent my “decision” even makes sense as a concept; I’ll clearly decide whatever my code says I’ll decide. As soon as you start assuming copies of my code being out there, I stop being comfortable with assigning me free will at all.
Anyway, none of this applies to real life, not even approximately. In real life, my decision cannot change your decision at all; in real life, nothing can even come close to predicting a decision I make in advance (assuming I put even a little bit of effort into that decision).
If you’re concerned about blushing etc., then you’re just saying the best strategy in a prisoner’s dilemma involves signaling very strongly that you’re trustworthy. I agree that this is correct against most human opponents. But surely you agree that if I can control my microexpressions, it’s best to signal “I will cooperate” while actually defecting, right?
Let me just ask you the following yes or no question: do you agree that my “always defect, but first pretend to be whatever will convince my opponent to cooperate” strategy beats all other strategies for a realistic one-shot prisoners’ dilemma? By one-shot, I mean that people will not have any memory of me defecting against them, so I can suffer no ill effects from retaliation.
I think this is correct. I think the reason to cooperate is not to get the best personal outcome, but because you care about the other person. I think we have evolved to cooperate, or perhaps that should be stated as we have evolved to want to cooperate. We have evolved to value cooperating. Our values come from our genes and our memes, and both are subject to evolution, to natural selection. But we want to cooperate.
So if I am in a prisoner’s dilemma against another human, if I perceive that other human as “one of us,” I will choose cooperation. Essentially, I care about their outcome. But in a one-shot PD defecting is the “better” strategy. The problem is that with genetic and/or memetic evolution of cooperation, we are not playing in a one-shot PD. We are playing with a set of values that developed over many shots.
Of course we don’t always cooperate. But when we do cooperate in one-shot PD’s, it is because, in some sense, there are so darn many one-shot PD’s, especially in the universe of hypotheticals, that we effectively know there is no such thing as a one-shot PD. This should not be too hard to accept around here where people semi-routinely accept simulations of themselves or clones of themselves as somehow just as important as their actual selves. I.e. we don’t even accept the “one-shottedness” of ourselves.
If you have 100% identical consequentialist values to all other humans, then that means ‘cooperation’ and ‘defection’ are both impossible for humans (because they can’t be put in PDs). Yet it will still be correct to defect (given that your decision and the other player’s decision don’t strongly depend on each other) if you ever run into an agent that doesn’t share all your values. See The True Prisoner’s Dilemma.
This shows that the iterated dilemma and the dilemma-with-common-knowledge-of-rationality allow cooperation (i.e., giving up on your goal to enable someone else to achieve a goal you genuinely don’t want them to achieve), whereas loving compassion and shared values merely change goal-content. To properly visualize the PD, you need an actual value conflict—e.g., imagine you’re playing against a serial killer in a hostage negotiation. ‘Cooperating’ is just an English-language label; the important thing is the game-theoretic structure, which allows that sometimes ‘cooperating’ looks like letting people die in order to appease a killer’s antisocial goals.
I think belief conflicts might work, even if the same values are shared. Suppose you and I are at a control panel for three remotely wired bombs in population centers. Both of us want as many people to live as possible. One bomb will go off in ten seconds unless we disarm it, but the others will stay inert unless activated. I believe that pressing the green button causes all bombs to explode, and pressing the red button defuses the time bomb. You believe the same thing, but with the colors reversed. Both of us would rather that no buttons be pressed than both buttons be pressed, but each of us would prefer that just the defuse button be pressed, and that the other person not mistakenly kill all three groups. (Here, attempting to defuse is ‘defecting’ and not attempting to defuse is ‘cooperating’.)
[Edit]: As written, in terms of lives saved, this doesn’t have the property that (D,D)>(C,D); if I press my button, you are indifferent between pressing your button or not. So it’s not true that D strictly dominates C, but the important part of the structure is preserved, and a minor change could make it so D strictly dominates C.
You can solve belief conflicts simply by trading in a prediction market with decision-contingent contracts (a “decision market”). Value conflicts are more general than that.
I think this is misusing the word “general.” Value conflicts are more narrow than the full class of games that have the PD preference ordering. I do agree that value conflicts are harder to resolve than belief conflicts, but that doesn’t make them more general.
True, but the flip side of this is that efficiency (in Coasian terms) is precisely defined as pursuing 100% identical consequentialist values, where the shared “values” are determined by a weighted sum of each agent’s utility function (and the weights are typically determined by agent endowments).
I just want to make it clear that by saying this, you’re changing the setting of the prisoners’ dilemma, so you shouldn’t even call it a prisoners’ dilemma anymore. The prisoners’ dilemma is defined so that you get more utility by defecting; if you say you care about your opponent’s utility enough to cooperate, it means you don’t get more utility by defecting, since cooperation gives you utility. Therefore, all you’re saying is that you can never be in a true prisoners’ dilemma game; you’re NOT saying that in a true PD, it’s correct to cooperate (again, by definition, it isn’t).
The most likely reason people are evolutionarily predisposed to cooperate in real-life PDs is that almost all real-life PDs are repeated games and not one-shot. Repeated prisoners’ dilemmas are completely different beasts, and it can definitely be correct to cooperate in them.
...
Despite the other things I’ve said here, that is my attitude as well. But I recognise that when I take that attitude, I am not solving the problem, only ignoring it. It may be perfectly sensible to ignore a problem, even a serious one (comparative advantage etc.). But dissolving a paradox is not achieved by clinging to one of the conflicting thoughts and ignoring the others. (Bullet-swallowing seems to consist of seizing onto the most novel one.) Eliminating the paradox requires showing where and how the thoughts went wrong.
I agree that resolving paradoxes is an important intellectual exercise, and that I wouldn’t be satisfied with simply ignoring an ontological argument (I’d want to find the flaw). But the best way to find such flaws is to discuss the ideas with others. At no point should one assign such a high probability to ideas like Roko’s basilisk being actually sound that one refuses to discuss them with others.
...
Despite the other things I’ve said here, that is my attitude as well. But I recognise that when I take that attitude, I am not solving the problem, only ignoring it. It may be perfectly sensible to ignore a problem, even a serious one (comparative advantage etc.). But dissolving a paradox is not achieved merely by clinging to one of the conflicting thoughts and ignoring the others. (Bullet-swallowing seems to consist of seizing onto the most implausible one.) Eliminating the paradox requires showing where and how the thoughts went wrong.
Finding an idea plausible has little to do with being extremely suspicious of fancy arguments claiming to prove it.
Idea that aren’t proven to be impossible are plausible even when there are no convincing arguments in favor of them.
Ideas that aren’t proven to be impossible are possible. They don’t have to be plausible.