If you’re saying ‘LessWrongers think there’s a serious risk they’ll be acausally blackmailed by a rogue AI’, then that seems to be false. That even seems to be false in Eliezer’s case,
Is it?
Assume that: a) There will be a future AI so powerful to torture people, even posthumously (I think this is quite speculative, but let’s assume it for the sake of the argument). b) This AI will be have a value system based on some form of utilitarian ethics. c) This AI will use an “acausal” decision theory (one that one-boxes in Newcomb’s problem).
Under these premises it seems to me that Roko’s argument is fundamentally correct.
As far as I can tell, belief in these premises was not only common in LessWrong at that time, but it was essentially the officially endorsed position of Eliezer Yudkowsky and SIAI. Therefore, we can deduce that EY should have believed that Roko’s argument was correct.
But EY claims that he didn’t believe that Roko’s argument was correct. So the question is: is EY lying?
His behavior was certainly consistent with him believing Roko’s argument. If he wanted to prevent the diffusion of that argument, then even lying about its correctness seems consistent.
So, is he lying? If he is not lying, then why didn’t he believe Roko’s argument? As far as I know, he never provided a refutation.
This was addressed on the LessWrongWiki page; I didn’t copy the full article here.
A few reasons Roko’s argument doesn’t work:
1 - Logical decision theories are supposed to one-box on Newcomb’s problem because it’s globally optimal even though it’s not optimal with respect to causally downstream events. A decision theory based on this idea could follow through on blackmail threats even when doing so isn’t causally optimal, which appears to put past agents at risk of coercion by future agents. But such a decision theory also prescribes ‘don’t be the kind of agent that enters into trades that aren’t globally optimal, even if the trade is optimal with respect to causally downstream events’. In other words, if you can bind yourself to precommitments to follow through on acausal blackmail, then it should also be possible to bind yourself to precommitments to ignore threats of blackmail.
The ‘should’ here is normative: there are probably some decision theories that let agents acausally blackmail each other, but others that perform well in Newcomb’s problem and the smoking lesion problem but can’t acausally blackmail each other; it hasn’t been formally demonstrated which theories fall into which category.
2 - Assuming you for some reason are following a decision theory that does put you at risk of acausal blackmail: Since the hypothetical agent is superintelligent, it has lots of ways to trick people into thinking it’s going to torture people without actually torturing them. Since this is cheaper, it would rather do that. And since we’re aware of this, we know any threat of blackmail would be empty. This means that we can’t be blackmailed in practice.
3 - A stronger version of 2 is that rational agents actually have an incentive to harshly punish attempts at blackmail in order to discourage it. So threatening blackmail can actually decrease an agent’s probability of being created, all else being equal.
4 - Insofar as it’s “utilitarian” to horribly punish anyone who doesn’t perfectly promote human flourishing, SIAI doesn’t seem to have endorsed utilitarianism.
4 means that the argument lacks practical relevance. The idea of CEV doesn’t build in very much moral philosophy, and it doesn’t build in predictions about the specific dilemmas future agents might end up in.
Assuming you for some reason are following a decision theory that does put you at risk of acausal blackmail: Since the hypothetical agent is superintelligent, it has lots of ways to trick people into thinking it’s going to torture people without actually torturing them. Since this is cheaper, it would rather do that. And since we’re aware of this, we know any threat of blackmail would be empty.
Um, your conclusion “since we’re aware of this, we know any threat of blackmail would be empty” contradicts your premise that the AI by virtue of being super-intelligent is capable of fooling people into thinking it’ll torture them.
One way of putting this is that the AI, once it exists, can convincingly trick people into thinking it will cooperate in Prisoner’s Dilemmas; but since we know it has this property and we know it prefers (D,C) over (C,C), we know it will defect. This is consistent because we’re assuming the actual AI is powerful enough to trick people once it exists; this doesn’t require the assumption that my low-fidelity mental model of the AI is powerful enough to trick me in the real world.
For acausal blackmail to work, the blackmailer needs a mechanism for convincing the blackmailee that it will follow through on its threat. ‘I’m a TDT agent’ isn’t a sufficient mechanism, because a TDT agent’s favorite option is still to trick other agents into cooperating in Prisoner’s Dilemmas while they defect.
1 - Humans can’t reliably precommit. Even if they could, precommittment is different than using an “acausal” decision theory. You don’t need precommitment to one-box in Newcomb’s problem, and the ability to precommit doesn’t guarantee by itself that you will one-box. In an adversarial game where the players can precommit and use a causal version of game theory, the one that can precommit first generally wins. E.g. Alice can precommit to ignore Bob’s threats, but she has no incentive to do so if Bob already precommitted to ignore Alice’s precommitments, and so on. If you allow for “acausal” reasoning, then even having a time advantage doesn’t work: if Bob isn’t born yet, but Alice predicts that she will be in an adversarial game with Bob and Bob will reason acausally and therefore he will have an incentive to threaten her and ignore her precommitments, then she has an incentive not to make such precommitment.
2 - This implies that the future AI uses a decision theory that two-boxes in Newcomb’s problem, contradicting the premise that it one-boxes.
3 - This implies that the future AI will have a deontological rule that says “Don’t blackmail” somehow hard-coded in it, contradicting the premise that it will be an utilitarian. Indeed, humans may want to build an AI with such constants, but in order to do so they will have to consider the possibility of blackmail and likely reject utilitarianism, which was the point of Roko’s argument.
Humans don’t follow any decision theory consistently. They sometimes give in to blackmail, and at other times resist blackmail. If you convinced a bunch of people to take acausal blackmail seriously, presumably some subset would give in and some subset would resist, since that’s what we see in ordinary blackmail situations. What would be interesting is if (a) there were some applicable reasoning norm that forced us to give in to acausal blackmail on pain of irrationality, or (b) there were some known human irrationality that made us inevitably susceptible to acausal blackmail. But I don’t think Roko gave a good argument for either of those claims.
From my last comment: “there are probably some decision theories that let agents acausally blackmail each other”. But if humans frequently make use of heuristics like ‘punish blackmailers’ and ‘never give in to blackmailers’, and if normative decision theory says they’re right to do so, there’s less practical import to ‘blackmailable agents are possible’.
This implies that the future AI uses a decision theory that two-boxes in Newcomb’s problem, contradicting the premise that it one-boxes.
No it doesn’t. If you model Newcomb’s problem as a Prisoner’s Dilemma, then one-boxing maps on to cooperating and two-boxing maps on to defecting. For Omega, cooperating means ‘I put money in both boxes’ and defecting means ‘I put money in just one box’. TDT recognizes that the only two options are mutual cooperation or mutual defection, so TDT cooperates.
Blackmail works analogously. Perhaps the blackmailer has five demands. For the blackmailee, full cooperation means ‘giving in to all five demands’; full defection means ‘rejecting all five demands’; and there are also intermediary levels (e.g., giving in to two demands while rejecting the other three), with the blackmailee prefer to do as little as possible.
For the blackmailer, full cooperation means ‘expending resources to punish the blackmailee in proportion to how many of my demands were met’. Full defection means ‘expending no resources to punish the blackmailee even if some demands aren’t met’. In other words, since harming past agents is costly, a blackmailer’s favorite scenario is always ‘the blackmailee, fearing punishment, gives in to most or all of my demands; but I don’t bother punishing them regardless of how many of my demands they ignored’. We could say that full defection doesn’t even bother to check how many of the demands were met, except insofar as this is useful for other goals.
The blackmailer wants to look as scary as possible (to get the blackmailee to cooperate) and then defect at the last moment anyway (by not following through on the threat), if at all possible. In terms of Newcomb’s problem, this is the same as preferring to trick Omega into thinking you’ll one-box, and then two-boxing anyway. We usually construct Newcomb’s problem in such a way that this is impossible; therefore TDT cooperates. But in the real world mutual cooperation of this sort is difficult to engineer, which makes fully credible acausal blackmail at least as difficult.
This implies that the future AI will have a deontological rule that says “Don’t blackmail” somehow hard-coded in it, contradicting the premise that it will be an utilitarian.
I think you misunderstood point 3. 3 is a follow-up to 2: humans and AI systems alike have incentives to discourage blackmail, which increases the likelihood that blackmail is a self-defeating strategy.
Shut up and multiply.
Eliezer has endorsed the claim “two independent occurrences of a harm (not to the same person, not interacting with each other) are exactly twice as bad as one”. This doesn’t tell us how bad the act of blackmail itself is, it doesn’t tell us how faithfully we should implement that idea in autonomous AI systems, and it doesn’t tell us how likely it is that a superintelligent AI would find itself forced into this particular moral dilemma.
Since Eliezer asserts a CEV-based agent wouldn’t blackmail humans, the next step in shoring up Roko’s argument would be to do more to connect the dots from “two independent occurrences of a harm (not to the same person, not interacting with each other) are exactly twice as bad as one” to a real-world worry about AI systems actually blackmailing people conditional on claims (a) and (c). ‘I find it scary to think a superintelligent AI might follow the kind of reasoning that can ever privilege torture over dust specks’ is not the same thing as ‘I’m scared a superintelligent AI will actually torture people because this will in fact be the best way to prevent a superastronomically large number of dust specks from ending up in people’s eyes’, so Roko’s particular argument has a high evidential burden.
“I precommit to shop at the store with the lowest price within some large distance, even if the cost of the gas and car depreciation to get to a farther store is greater than the savings I get from its lower price. If I do that, stores will have to compete with distant stores based on price, and thus it is more likely that nearby stores will have lower prices. However, this precommitment would only work if I am actually willing to go to the farther store when it has the lowest price even if I lose money”.
You’ve described the mechanism by which the precommitment happened, not actually disputed whether it happens.
Many “irrational” actions by human beings can be analyzed as precommitment; for instance, wanting to take revenge on people who have hurt you even if the revenge doesn’t get you anything.
My point is that to my knowledge, given the evidence that I have about his beliefs at that time, and his actions, and assuming that I’m not misunderstanding them or Roko’s argument, then it seems that there is a significant probability that EY lied about not beliving that Roko’s argument was correct.
Is it?
Assume that:
a) There will be a future AI so powerful to torture people, even posthumously (I think this is quite speculative, but let’s assume it for the sake of the argument).
b) This AI will be have a value system based on some form of utilitarian ethics.
c) This AI will use an “acausal” decision theory (one that one-boxes in Newcomb’s problem).
Under these premises it seems to me that Roko’s argument is fundamentally correct.
As far as I can tell, belief in these premises was not only common in LessWrong at that time, but it was essentially the officially endorsed position of Eliezer Yudkowsky and SIAI. Therefore, we can deduce that EY should have believed that Roko’s argument was correct.
But EY claims that he didn’t believe that Roko’s argument was correct. So the question is: is EY lying?
His behavior was certainly consistent with him believing Roko’s argument. If he wanted to prevent the diffusion of that argument, then even lying about its correctness seems consistent.
So, is he lying? If he is not lying, then why didn’t he believe Roko’s argument? As far as I know, he never provided a refutation.
This was addressed on the LessWrongWiki page; I didn’t copy the full article here.
A few reasons Roko’s argument doesn’t work:
1 - Logical decision theories are supposed to one-box on Newcomb’s problem because it’s globally optimal even though it’s not optimal with respect to causally downstream events. A decision theory based on this idea could follow through on blackmail threats even when doing so isn’t causally optimal, which appears to put past agents at risk of coercion by future agents. But such a decision theory also prescribes ‘don’t be the kind of agent that enters into trades that aren’t globally optimal, even if the trade is optimal with respect to causally downstream events’. In other words, if you can bind yourself to precommitments to follow through on acausal blackmail, then it should also be possible to bind yourself to precommitments to ignore threats of blackmail.
The ‘should’ here is normative: there are probably some decision theories that let agents acausally blackmail each other, but others that perform well in Newcomb’s problem and the smoking lesion problem but can’t acausally blackmail each other; it hasn’t been formally demonstrated which theories fall into which category.
2 - Assuming you for some reason are following a decision theory that does put you at risk of acausal blackmail: Since the hypothetical agent is superintelligent, it has lots of ways to trick people into thinking it’s going to torture people without actually torturing them. Since this is cheaper, it would rather do that. And since we’re aware of this, we know any threat of blackmail would be empty. This means that we can’t be blackmailed in practice.
3 - A stronger version of 2 is that rational agents actually have an incentive to harshly punish attempts at blackmail in order to discourage it. So threatening blackmail can actually decrease an agent’s probability of being created, all else being equal.
4 - Insofar as it’s “utilitarian” to horribly punish anyone who doesn’t perfectly promote human flourishing, SIAI doesn’t seem to have endorsed utilitarianism.
4 means that the argument lacks practical relevance. The idea of CEV doesn’t build in very much moral philosophy, and it doesn’t build in predictions about the specific dilemmas future agents might end up in.
Um, your conclusion “since we’re aware of this, we know any threat of blackmail would be empty” contradicts your premise that the AI by virtue of being super-intelligent is capable of fooling people into thinking it’ll torture them.
One way of putting this is that the AI, once it exists, can convincingly trick people into thinking it will cooperate in Prisoner’s Dilemmas; but since we know it has this property and we know it prefers (D,C) over (C,C), we know it will defect. This is consistent because we’re assuming the actual AI is powerful enough to trick people once it exists; this doesn’t require the assumption that my low-fidelity mental model of the AI is powerful enough to trick me in the real world.
For acausal blackmail to work, the blackmailer needs a mechanism for convincing the blackmailee that it will follow through on its threat. ‘I’m a TDT agent’ isn’t a sufficient mechanism, because a TDT agent’s favorite option is still to trick other agents into cooperating in Prisoner’s Dilemmas while they defect.
Except it needs to convince the people who are around before it exists.
1 - Humans can’t reliably precommit. Even if they could, precommittment is different than using an “acausal” decision theory. You don’t need precommitment to one-box in Newcomb’s problem, and the ability to precommit doesn’t guarantee by itself that you will one-box. In an adversarial game where the players can precommit and use a causal version of game theory, the one that can precommit first generally wins. E.g. Alice can precommit to ignore Bob’s threats, but she has no incentive to do so if Bob already precommitted to ignore Alice’s precommitments, and so on. If you allow for “acausal” reasoning, then even having a time advantage doesn’t work: if Bob isn’t born yet, but Alice predicts that she will be in an adversarial game with Bob and Bob will reason acausally and therefore he will have an incentive to threaten her and ignore her precommitments, then she has an incentive not to make such precommitment.
2 - This implies that the future AI uses a decision theory that two-boxes in Newcomb’s problem, contradicting the premise that it one-boxes.
3 - This implies that the future AI will have a deontological rule that says “Don’t blackmail” somehow hard-coded in it, contradicting the premise that it will be an utilitarian. Indeed, humans may want to build an AI with such constants, but in order to do so they will have to consider the possibility of blackmail and likely reject utilitarianism, which was the point of Roko’s argument.
4 - Shut up and multiply.
Humans don’t follow any decision theory consistently. They sometimes give in to blackmail, and at other times resist blackmail. If you convinced a bunch of people to take acausal blackmail seriously, presumably some subset would give in and some subset would resist, since that’s what we see in ordinary blackmail situations. What would be interesting is if (a) there were some applicable reasoning norm that forced us to give in to acausal blackmail on pain of irrationality, or (b) there were some known human irrationality that made us inevitably susceptible to acausal blackmail. But I don’t think Roko gave a good argument for either of those claims.
From my last comment: “there are probably some decision theories that let agents acausally blackmail each other”. But if humans frequently make use of heuristics like ‘punish blackmailers’ and ‘never give in to blackmailers’, and if normative decision theory says they’re right to do so, there’s less practical import to ‘blackmailable agents are possible’.
No it doesn’t. If you model Newcomb’s problem as a Prisoner’s Dilemma, then one-boxing maps on to cooperating and two-boxing maps on to defecting. For Omega, cooperating means ‘I put money in both boxes’ and defecting means ‘I put money in just one box’. TDT recognizes that the only two options are mutual cooperation or mutual defection, so TDT cooperates.
Blackmail works analogously. Perhaps the blackmailer has five demands. For the blackmailee, full cooperation means ‘giving in to all five demands’; full defection means ‘rejecting all five demands’; and there are also intermediary levels (e.g., giving in to two demands while rejecting the other three), with the blackmailee prefer to do as little as possible.
For the blackmailer, full cooperation means ‘expending resources to punish the blackmailee in proportion to how many of my demands were met’. Full defection means ‘expending no resources to punish the blackmailee even if some demands aren’t met’. In other words, since harming past agents is costly, a blackmailer’s favorite scenario is always ‘the blackmailee, fearing punishment, gives in to most or all of my demands; but I don’t bother punishing them regardless of how many of my demands they ignored’. We could say that full defection doesn’t even bother to check how many of the demands were met, except insofar as this is useful for other goals.
The blackmailer wants to look as scary as possible (to get the blackmailee to cooperate) and then defect at the last moment anyway (by not following through on the threat), if at all possible. In terms of Newcomb’s problem, this is the same as preferring to trick Omega into thinking you’ll one-box, and then two-boxing anyway. We usually construct Newcomb’s problem in such a way that this is impossible; therefore TDT cooperates. But in the real world mutual cooperation of this sort is difficult to engineer, which makes fully credible acausal blackmail at least as difficult.
I think you misunderstood point 3. 3 is a follow-up to 2: humans and AI systems alike have incentives to discourage blackmail, which increases the likelihood that blackmail is a self-defeating strategy.
Eliezer has endorsed the claim “two independent occurrences of a harm (not to the same person, not interacting with each other) are exactly twice as bad as one”. This doesn’t tell us how bad the act of blackmail itself is, it doesn’t tell us how faithfully we should implement that idea in autonomous AI systems, and it doesn’t tell us how likely it is that a superintelligent AI would find itself forced into this particular moral dilemma.
Since Eliezer asserts a CEV-based agent wouldn’t blackmail humans, the next step in shoring up Roko’s argument would be to do more to connect the dots from “two independent occurrences of a harm (not to the same person, not interacting with each other) are exactly twice as bad as one” to a real-world worry about AI systems actually blackmailing people conditional on claims (a) and (c). ‘I find it scary to think a superintelligent AI might follow the kind of reasoning that can ever privilege torture over dust specks’ is not the same thing as ‘I’m scared a superintelligent AI will actually torture people because this will in fact be the best way to prevent a superastronomically large number of dust specks from ending up in people’s eyes’, so Roko’s particular argument has a high evidential burden.
“I precommit to shop at the store with the lowest price within some large distance, even if the cost of the gas and car depreciation to get to a farther store is greater than the savings I get from its lower price. If I do that, stores will have to compete with distant stores based on price, and thus it is more likely that nearby stores will have lower prices. However, this precommitment would only work if I am actually willing to go to the farther store when it has the lowest price even if I lose money”.
Miraculously, people do reliably act this way.
I doubt it. Reference?
Mostly because they don’t actually notice the cost of gas and car depreciation at the time...
You’ve described the mechanism by which the precommitment happened, not actually disputed whether it happens.
Many “irrational” actions by human beings can be analyzed as precommitment; for instance, wanting to take revenge on people who have hurt you even if the revenge doesn’t get you anything.
Lying is consistent with a lot of behavior. The fact that it is, is no basis to accuse people of lying.
I’m not accusing, I’m asking the question.
My point is that to my knowledge, given the evidence that I have about his beliefs at that time, and his actions, and assuming that I’m not misunderstanding them or Roko’s argument, then it seems that there is a significant probability that EY lied about not beliving that Roko’s argument was correct.
He’s almost certainly lying about what he believed back then. I have no idea if he’s lying about his current beliefs.