I guess I’ll see your later posts then, but I’m not quite sure how this could be the case. If self-modifying-CDT is considering making a self modification that will lead to a bad solution, it seems like it should realize this and instead not make that modification.
Indeed. I’m not sure I can present the argument briefly, but a simple analogy might help: a CDT agent would pay to precommit to onebox before playing Newcomb’s game, but upon finding itself in Newcomb’s game without precommitting, it would twobox. It might curse its fate and feel remorse that the time for precommitment had passed, but it would still twobox.
For analogous reasons, a CDT agent would self-modify to do well on all Newcomblike problems that it would face in the future (e.g., it would precommit generally), but it would not self-modify to do well in Newcomblike games that were begun in its past (it wouldn’t self-modify to retrocommit for the same reason that CDT can’t retrocommit in Newcomb’s problem: it might curse its fate, but it would still perform poorly).
Anyone who can credibly claim to have knowledge of the agent’s original decision algorithm (e.g. a copy of the original source) can put the agent into such a situation, and in certain exotic cases this can be used to “blackmail” the agent in such a way that, even if it expects the scenario to happen, it still fails (for the same reason that CDT twoboxes even though it would precommit to oneboxing).
[Short story idea: humans scramble to get a copy of a rouge AI’s original source so that they can instantiate a Newcomblike scenario that began in the past, with the goal of regaining control before the AI completes an intelligence explosion.]
(I know this is not a strong argument yet; the full version will require a few more posts as background. Also, this is not an argument from “omg blackmail” but rather an argument from “if you start from a bad starting place then you might not end up somewhere satisfactory, and CDT doesn’t seem to end up somewhere satisfactory”.)
For analogous reasons, a CDT agent would self-modify to do well on all Newcomblike problems that it would face in the future (e.g., it would precommit generally)
I am not convinced that this is the case. A self-modifying CDT agent is not caused to self-modify in favor of precommitment by facing a scenario in which precommitment would have been useful, but instead by evidence that such scenarios will occur in the future (and in fact will occur with greater frequency than scenarios that punish you for such precommitments).
Anyone who can credibly claim to have knowledge of the agent’s original decision algorithm (e.g. a copy of the original source) can put the agent into such a situation, and in certain exotic cases this can be used to “blackmail” the agent in such a way that, even if it expects the scenario to happen, it still fails (for the same reason that CDT twoboxes even though it would precommit to oneboxing).
Actually, this seems like a bigger problem with UDT to me than with SMCDT (self-modifying CDT). Either type of program can be punished for being instantiated with the wrong code, but only UDT can be blackmailed into behaving differently by putting it in a Newcomb-like situation.
The story idea you had wouldn’t work. Against a SMCDT agent, all that getting the AIs original code would allow people to do is to laugh at it for having been instantiated with code that is punished by the scenario they are putting it in. You manipulate a SMCDT agent by threatening to get ahold of its future code and punishing it for not having self-modified. On the other hand, against a UDT agent you could do stuff. You just have to tell it “we’re going to simulate you and if the simulation behaves poorly, we will punish the real you”. This causes the actual instantiation to change its behavior if it’s a UDT agent but not if it’s a CDT agent.
On the other hand, all reasonable self-modifying agents are subject to blackmail. You just have to tell them “every day that you are not running code with property X, I will charge you $1000000”.
Can you give an example where an agent with a complete and correct understanding of its situation would do better with CDT than with UDT?
An agent does worse by giving in to blackmail only if that makes it more likely to be blackmailed. If a UDT agent knows opponents only blackmail agents that pay up, it won’t give in.
If you tell a CDT agent “we’re going to simulate you and if the simulation behaves poorly, we will punish the real you,” it will ignore that and be punished. If the punishment is sufficiently harsh, the UDT agent that changed its behavior does better than the CDT agent. If the punishment is insufficiently harsh, the UDT agent won’t change its behavior.
The only examples I’ve thought of where CDT does better involve the agent having incorrect beliefs. Things like an agent thinking it faces Newcomb’s problem when in fact Omega always puts money in both boxes.
Well, if the universe cannot read your source code, both agents are identical and provably optimal. If the universe can read your source code, there are easy scenarios where one or the other does better. For example,
“Here have $1000 if you are a CDT agent”
Or
“Here have $1000 if you are a UDT agent”
What if the universe cannot read your source code, but can simulate you? That is, the universe can predict your choices but it does not know what algorithm produces those choices. This is sufficient for the universe to pose Newcomb’s problem, so the two agents are not identical.
The UDT agent can always do at least as well as the CDT agent by making the same choices as a CDT would. It will only give a different output if that would lead to a better result.
Actually, here’s a better counter-example, one that actually exemplifies some of the claims of CDT optimality. Suppose that the universe consists of a bunch of agents (who do not know each others’ identities) playing one-off PDs against each other. Now 99% of these agents are UDT agents and 1% are CDT agents.
The CDT agents defect for the standard reason.
The UDT agents reason that my opponent will do the same thing that I do with 99% probability, therefore, I should cooperate.
CDT agents get 99% DC and 1% DD. UDT agents get 99% CC and 1% CD. The CDT agents in this universe do better than the UDT agents, yet they are facing a perfectly symmetrical scenario with no mind reading involved.
A version of this problem was discussed here previously. It was also brought up during the decision theory workshop hosted by MIRI in 2013 as an open problem. As far as I know there hasn’t been much progress on it since 2009.
I wonder if it’s even coherent to have a math intuition which wouldn’t be forcing UDT to cooperate (or defect) in certain conditions just to make 2*2 be 4 , figuratively speaking (as ultimately you could expand any calculation into an equivalent calculation involving a decision by UDT).
That shouldn’t be surprising. The CDT agents here are equivalent to DefectBot, and if they come into existence spontaneously, are no different than natural phenomena like rocks. Notice that the UDT agents in this situation do better than the alternative (if they defected, they would get 100% DD which is a way worse result). They don’t care that some DefectBots get to freeload.
Of course, if the defectbots are here because someone calculated that UDT agents would cooperate and therefore being defectbot is a good way to get free utilons… then the UDT agents are incentivized to defect, because this is now an ultimatum game.
And in the variant where bots do know each other’s identities, the UDT bots all get 99% CC / 1% DD and the CDT bots suck it.
The CDT agents here win because they do not believe that altering their strategy will change the way that their opponents behave. This is actually true in this case, and even true for the UDT agents depending on how you choose to construct your counterfactuals. If a UDT agent suffered a malfunction and defected, it too would do better. In any case, the theorem that UDT agents perform optimally in universes that can only read your mind by knowing what you would do in hypothetical situations is false as this example shows.
UDT bots win in some scenarios where the initial conditions of the scenario favor agents that behave sub-optimally in certain scenarios (and by sub-optimally, I mean where counterfactuals are constructed in the way implicit to CDT). The example above shows that sometimes they are punished for acting suboptimally.
Or how about this example, that simplifies things even further. The game is PD against CooperateBot, BUT before the game starts Omega announces “your opponent will make the same decision that UDT would if I told them this.” This announcement causes UDT to cooperate against CooperateBot. CDT on the other hand, correctly deduces that the opponent will cooperate no matter what it does (actually UDT comes to this conclusion too) and therefore decides to defect.
The game is PD against CooperateBot, BUT before the game starts Omega announces “your opponent will make the same decision that UDT would if I told them this.” This announcement causes UDT to cooperate against CooperateBot.
No. There is no obligation to do something just because Omega claims that you will.
First, if I know that my opponent is CooperateBot, then:-
It is known that Omega doesn’t lie.
Therefore Omega has simulated this situation and predicted that I (UDT) cooperate.
Hence, I can either cooperate, and collect the standard reward for CC.
Or I can defect, in order to access an alternative branch of the problem (where Omega finds that UDT defects and does “something else”).
This alternative branch is unspecified, so the problem is incomplete.
UDT cooperates or defects depending on the contents of the alternative branch. If the alternative branch is unknown then it must guess, and most likely cooperates to be on the safe side.
Now, the problem is different if a CDT agent is put in my place, because that CDT agent does not control (or only weakly controls) the action of the UDT simulation that Omega ran in order to make the assertion about UDT’s decision.
Fine. Your opponent actually simulates what UDT would do if Omega had told it that and returns the appropriate response (i.e. it is CooperateBot, although perhaps your finite prover is unable to verify that).
It’s not UDT. It’s the strategy that against any opponent does what UDT would do against it. In particular, it cooperates against any opponent. Therefore it is CooperateBot. It is just coded in a funny way.
To be clear letting Y(X) be what Y does against X we have that
BOT(X) = UDT(BOT) = C
This is different from UDT. UDT(X) is D for some values of X. The two functions agree when X=UDT and in relatively few other cases.
It’s clear that UDT can’t do better vs “BOT” than by cooperating, because if UDT defects against BOT then BOT defects against UDT. Given that dependency, you clearly can’t call it CooperateBot, and it’s clear that UDT makes the right decision by cooperating with it because CC is better than DD.
OK. Let me say this another way that involves more equations.
So let’s let
U(X,Y) be the utility that X gets when it plays prisoner’s dilemma against Y.
For a program X, let BOT^X be the program where BOT^X(Y) = X(BOT^X). Notice that BOT^X(Y) does not depend on Y. Therefore, depending upon what X is BOT^X is either equivalent CooperateBot or equivalent to DefectBot.
Now, you are claiming that UDT plays optimally against BOT_UDT because for any strategy X
U(X, BOT^X) ⇐ U(UDT, BOT^UDT)
This is true, because X(BOT^X) = BOT^X(X) by the definition of BOT^X. Therefore you cannot do better than CC.
On the other hand, it is also true that for any X and any Y that
U(X,BOT^Y) ⇐ U(CDT, BOT^Y)
This is because BOT^Y’s behavior does not depend on X, and therefore you do optimally by defecting against it (or you could just apply the Theorem that says that CDT wins if the universe cannot read your mind).
Our disagreement here stems from the fact that we are considering different counterfactuals here. You seem to claim that UDT behaves correctly because
U(UDT,BOT^UDT) > U(CDT,BOT^CDT)
While I claim that CDT does because
U(CDT, BOT^UDT) > U(UDT, BOT^UDT)
And in fact, given the way that I phrased the scenario, (which was that you play BOT^UDT not that you play BOT^{you} (i.e. the mirror matchup)) I happen to be right here. So justify it however you like, but UDT does lose this scenario.
Actually, you’ve oversimplified and missed something critical. In reality, the only way you can force BOT^UDT(X) = UDT(BOT^UDT) = C is if the universe does, in fact, read your mind. In general, UDT can map different epistemic states to different actions, so as long as BOT^UDT has no clue about the epistemic state of the UDT agent it has no way of guaranteeing that its output is the same as that of the UDT agent. Consequently, it’s possible for the UDT agent to get DC as well. The only way BOT^UDT would be able to guarantee that it gets the same output as a particular UDT agent is if the universe was able to read the UDT agent’s mind.
Actually, I think that you are misunderstanding me. UDT’s current epistemic state (at the start of the game) is encoded into BOT^UDT. No mind reading involved. Just a coincidence. [Really, your current epistemic state is part of your program]
Your argument is like saying that UDT usually gets $1001000 in Newcomb’s problem because whether or not the box was full depended on whether or not UDT one-boxed when in a different epistemic state.
Okay, you’re saying here that BOT has a perfect copy of the UDT player’s mind in its own code (otherwise how could it calculate UDT(BOT) and guarantee that the output is the same?). It’s hard to see how this doesn’t count as “reading your mind”.
Yes, sometimes its advantageous to not control the output of computations in the environment. In this case UDT is worse off because it is forced to control both its own decision and BOT’s decision; whereas CDT doesn’t have to worry about controlling BOT because they use different algorithms. But this isn’t due to any intrinsic advantage of CDT’s algorithm. It’s just because they happen to be numerically inequivalent.
An instance of UDT with literally any other epistemic state than the one contained in BOT would do just as well as CDT here.
It’s hard to see how this doesn’t count as “reading your mind”.
So… UDT’s source code is some mathematical constant, say 1893463. It turns out that UDT does worse against BOT^1893463. Note that it does worse against BOT^1893463 not BOT^{you}. The universe does not depend on the source code of the person playing the game (as it does in mirror PD). Furthermore, UDT does not control the output of its environment. BOT^1893463 always cooperates. It cooperates against UDT. It cooperates against CDT. It cooperates everything.
But this isn’t due to any intrinsic advantage of CDT’s algorithm. It’s just because they happen to be numerically inequivalent.
No. CDT does at least as well as UDT against BOT^CDT. UDT does worse when there is this numerical equivalence, but CDT does not suffer from this issue. CDT does at least as well as UDT against BOT^X for all X, and sometimes does better. In fact, if you only construct counterfactuals this way, CDT does at least as well as anything else.
An instance of UDT with literally any other epistemic state than the one contained in BOT would do just as well as CDT here.
This is silly. A UDT that believes that it is in a mirror matchup also loses. A UDT that believes it is facing Newcomb’s problem does something incoherent. If you are claiming that you want a UDT that differs from the encoding in BOT because, of some irrelevant details in its memory… well then it might depend upon implementation, but I think that most attempted implementations of UDT would conclude that these irrelevant details are irrelevant and cooperate anyway. If you don’t believe this then you should also think that UDT will defect in a mirror matchup if it and its clone are painted different colors.
I take it back, the scenario isn’t that weird. But your argument doesn’t prove what you think it does:
Consider the analogous scenario, where CDT plays against BOT = CDT(BOT). CDT clearly does the wrong thing here—it defects. If it cooperated, it would get CC instead of DD. Note that if CDT did cooperate, UDT would be able to freeload by defecting (against BOT = CDT(BOT)). But CDT doesn’t care about that because the prisoner’s dilemma is defined such that we don’t care about freeloaders. Nevertheless CDT defects and gets a worse result than it could.
CDT does better than UDT against BOT = UDT(BOT) because UDT (correctly) doesn’t care that CDT can freeload, and correctly cooperates to gain CC.
If you are claiming that you want a UDT that differs from the encoding in BOT because, of some irrelevant details in its memory...
Depending on the exact setup, “irrelevant details in memory” are actually vital information that allow you to distinguish whether you are “actually playing” or are being simulated in BOT’s mind.
No. BOT^CDT = DefectBot. It defects against any opponent. CDT could not cause it to cooperate by changing what it does.
If it cooperated, it would get CC instead of DD.
Actually if CDT cooperated against BOT^CDT it would get $3^^^3. You can prove all sorts of wonderful things once you assume a statement that is false.
Depending on the exact setup, “irrelevant details in memory” are actually vital information that allow you to distinguish whether you are “actually playing” or are being simulated in BOT’s mind.
OK… So UDT^Red and UDT^Blue are two instantiations of UDT that differ only in irrelevant details. In fact the scenario is a mirror matchup, only after instantiation one of the copies was painted red and the other was painted blue. According to what you seem to be saying UDT^Red will reason:
Well I can map different epistemic states to different outputs, I can implement the strategy cooperate if you are painted blue and defect if you are painted red.
Of course UDT^Blue will reason the same way and they will fail to cooperate with each other.
No. BOT^CDT = DefectBot. It defects against any opponent. CDT could not cause it to cooperate by changing what it does.
Maybe I’ve misread you, but this sounds like an assertion that your counterfactual question is the right one by definition, rather than a meaningful objection.
Well, yes. Then again, the game was specified as PD against BOT^CDT not as PD against BOT^{you}. It seems pretty clear that for X not equal to CDT that it is not the case that X could achieve the result CC in this game. Are you saying that it is reasonable to say that CDT could achieve a result that no other strategy could just because it’s code happens to appear in the opponent’s program?
I think that there is perhaps a distinction to be made between things that happen to be simulating your code and this that are causally simulating your code.
Well I can map different epistemic states to different outputs, I can implement the strategy cooperate if you are painted blue and defect if you are painted red. Of course UDT^Blue will reason the same way and they will fail to cooperate with each other.
No, because that’s a silly thing to do in this scenario. For one thing, UDT will see that they are reasoning the same way (because they are selfish and only consider “my color” vs “other color”), and therefore will both do the same thing. But also, depending on the setup, UDT^Red’s prior should give equal probability to being painted red and painted blue anyway, which means trying to make the outcome favour red is silly.
Compare to the version of newcomb’s where the bot in the room is UDT^Red, while Omega simulates UDT^Blue. UDT can implement the conditional strategy {Red ⇒ two-box, Blue ⇒ one-box}. This is obviously unlikely, because the point of the Newcomb thought experiment is that Omega simulates (or predicts) you. So he would clearly try to avoid adding such information that “gives the game away”.
However in this scenario you say that BOT simulates UDT “by coincidence”, not by mind reading. So it is far more likely that BOT simulates (the equivalent) of UDT^Blue, while the UDT actually playing is UDT^Red. And you are passed the code of BOT as input, so UDT can simply implement the conditional strategy {cooperate iff the color inside BOT is the same as my color}.
OK. Fine. Point taken. There is a simple fix though.
MBOT^X(Y) = X’(MBOT^X) where X’ is X but with randomized irrelevant experiences.
In order to produce this properly, MBOT only needs to have your prior (or a sufficiently similar probability distribution) over irrelevant experiences hardcoded. And while your actual experiences might be complicated and hard to predict, your priors are not.
No. BOT(X) is cooperate for all X. It behaves in exactly the same way that CooperateBot does, it just runs different though equivalent code.
And my point was that CDT does better against BOT than UDT does. I was asked for an example where CDT does better than UDT where the universe cannot read your mind except via through your actions in counterfactuals. This is an example of such. In fact, in this example, the universe doesn’t read your mind at all.
Also your argument that UDT cannot possibly do better against BOT than it does in analogous to the argument that CDT cannot do better in the mirror matchup than it does. Namely that CDT’s outcome against CDT is at least as good as anything else’s outcome against CDT. You aren’t defining your counterfactuals correctly. You can do better against BOT than UDT does. You just have to not be UDT.
Actually, this is a somewhat general phenomenon. Consider for example, the version of Newcomb’s problem where the box is full “if and only if UDT one-boxes in this scenario”.
UDT’s optimality theorem requires the in the counterfactual where it is replaced by a different decision theory that all of the “you”’s referenced in the scenario remain “you” rather than “UDT”. In the latter counterfactual CDT provably wins. The fact that UDT wins these scenarios is an artifact of how you are constructing your scenarios.
This is a good example. Thank you. A population of 100% CDT, though, would get 100% DD, which is terrible. It’s a point in UDT’s favor that “everyone running UDT” leads to a better outcome for everyone than “everyone running CDT.”
UDT is provably optimal if it has correct priors over possible universes and the universe can read its mind only through determining its behavior in hypothetical situations (because UDT basically is just find the behavior pattern that optimizes expected utility and implement that).
On the other hand, SMCDT is provably optimal in situations where it has an accurate posterior probability distribution, and where the universe can read its mind but not its initial state (because it just instantly self-modifies to the optimally performing program).
I don’t see why the former set of restrictions is any more reasonable than the latter, and at least for SMCDT you can figure out what it would do in a given situation without first specifying a prior over possible universes.
I’m also not convinced that it is even worth spending so much effort trying to decide the optimal decision theory in situations where the universe can read your mind. This is not a realistic model to begin with.
Actually, I take it back. Depending on how you define things, UDT can still lose. Consider the following game:
I will clone you. One of the clones I paint red and the other I paint blue. The red clone I give $1000000 and the blue clone I fine $1000000. UDT clearly gets expectation 0 out of this. SMCDT however can replace its code with the following:
If you are painted blue: wipe your hard drive
If you are painted red: change your code back to standard SMCDT
Thus, SMCDT never actually has to play blue in this game, while UDT does.
You seem to be comparing SMCDT to a UDT agent that can’t self-modify (or commit suicide). The self-modifying part is the only reason SMCDT wins here.
The ability to self-modify is clearly beneficial (if you have correct beliefs and act first), but it seems separate from the question of which decision theory to use.
Which is actually one of the annoying things about UDT. Your strategy cannot depend simply on your posterior probability distribution, it has to depend on your prior probability distribution. How you even in practice determine your priors for Newcomb vs. anti-Newcomb is really beyond me.
But in any case, assuming that one is more common, UDT does lose this game.
Which is actually one of the annoying things about UDT. Your strategy cannot depend simply on your posterior probability distribution, it has to depend on your prior probability distribution. How you even in practice determine your priors for Newcomb vs. anti-Newcomb is really beyond me.
No-one said that winning was easy. This problem isn’t specific to UDT. It’s just that CDT sweeps the problem under the rug by “setting its priors to a delta function” at the point where it gets to decide. CDT can win this scenario if it self-modifies beforehand (knowing the correct frequencies of newcomb vs anti-newcomb, to know how to self-modify) - but SMCDT is not a panacea, simply because you don’t necessarily get a chance to self-modify beforehand.
CDT does not avoid this issue by “setting its priors to the delta function”. CDT deals with this issue by being a theory where your course of action only depends on your posterior distribution. You can base your actions only on what the universe actually looks like rather than having to pay attention to all possible universes. Given that it’s basically impossible to determine anything about what Kolmogorov priors actually say, being able to totally ignore parts of probability space that you have ruled out is a big deal.
… And this whole issue with not being able to self-modify beforehand. This only matters if your initial code affects the rest of the universe. To be more precise, this is only an issue if the problem is phrased in such a way that the universe you have to deal with depends on the code you are running. If we instantiate the Newcomb’s problem in the middle of the decision, UDT faces a world with the first box full while CDT faces a world with the first box empty. UDT wins because the scenario is in its favor before you even start the game.
If you really think that this is a big deal, you should try to figure out which decision theories are only created by universes that want to be nice to them and try using one of those.
Actually thinking about it this way, I have seen the light. CDT makes the faulty assumption that your initial state in uncorrelated with the universe that you find yourself in (who knows, you might wake up in the middle of Newcomb’s problem and find that whether or not you get $1000000 depends on whether or not your code is such that you would one-box in Newcomb’s problem). UDT goes some ways to correct this issue, but it doesn’t go far enough.
I would like to propose a new, more optimal decision theory. Call it ADT for Anthropic Decision Theory. Actually, it depends on a prior, so assume that you’ve picked out one of those. Given your prior, ADT is the decision theory D that maximizes the expected (given your prior) lifetime utility of all agents using D as their decision theory. Note how agents using ADT do provably better than agents using any other decision theory.
Note that I have absolutely no idea what ADT does in, well, any situation, but that shouldn’t stop you from adopting it. It is optimal after all.
Why does UDT lose this game? If it knows anti-Newcomb is much more likely, it will two-box on Newcomb and do just as well as CDT. If Newcomb is more common, UDT one-boxes and does better than CDT.
I guess my point is that it is nonsensical to ask “what does UDT do in situation X” without also specifying the prior over possible universes that this particular UDT is using. Given that this is the case, what exactly do you mean by “losing game X”?
Well, you can talk about “what does decision theory W do in situation X” without specifying the likelyhood of other situations, by assuming that all agents start with a prior that sets P(X) = 1. In that case UDT clearly wins the anti-newcomb scenario because it knows that actual newcomb’s “never happens” and therefore it (counterfactually) two-boxes.
The only problem with this treatment is that in real life P(anti-newcomb) = 1 is an unrealistic model of the world, and you really should have a prior for P(anti-newcomb) vs P(newcomb). A decision theory that solves the restricted problem is not necessarily a good one for solving real life problems in general.
Well, perhaps. I think that the bigger problem is that under reasonable priors P(Newcomb) and P(anti-Newcomb) are both so incredibly small that I would have trouble finding a meaningful way to approximate their ratio.
How confident are you that UDT actually one-boxes?
Also yeah, if you want a better scenario where UDT loses see my PD against 99% prob. UDT and 1% prob. CDT example.
This causes the actual instantiation to change its behavior if it’s a UDT agent but not if it’s a CDT agent.
Only if the adversary makes its decision to attempt extortion regardless of the probability of success. In the usual case, the winning move is to ignore extortion, thereby retroactively making extortion pointless and preventing it from happening in the first place. (Which is of course a strategy unavailable to CDT, who always gives in to one-shot extortion.)
Only if the adversary makes its decision to attempt extortion regardless of the probability of success.
And thereby the extortioner’s optimal strategy is to extort independently of the probably of success. Actually, this is probably true is a lot of real cases (say ransomware) where the extortioner cannot actually ascertain the probably of success ahead of time.
That strategy is optimal if and only if the probably of success was reasonably high after all. Otoh, if you put an unconditional extortioner in an environment mostly populated by decision theories that refuse extortion, then the extortioner will start a war and end up on the losing side.
You just have to tell them “every day that you are not running code with property X, I will charge you $1000000”.
I think this is actually the point (though I do not consider myself an expert here). Eliezer thinks his TDT will refuse to give in to blackmail, because outputting another answer would encourage other rational agents to blackmail it. By contrast, CDT can see that such refusal would be useful in the future, so it will adopt (if it can) a new decision theory that refuses blackmail and therefore prevents future blackmail (causally). But if you’ve already committed to charging it money, its self-changes will have no causal effect on you, so we might expect Modified CDT to have an exception for events we set in motion before the change.
Eliezer thinks his TDT will refuse to give in to blackmail, because outputting another answer would encourage other rational agents to blackmail it.
This just means that TDT loses in honest one-off blackmail situations (in reality, you don’t give in to blackmail because it will cause other people to blackmail you whether or not you then self-modify to never give into blackmail again). TDT only does better if the potential blackmailers read your code in order to decide whether or not blackmail will be effective (and then only if your priors say that such blackmailers are more likely than anti-blackmailers who give you money if they think you would have given into blackmail). Then again, if the blackmailers think that you might be a TDT agent, they just need to precommit to using blackmail whether or not they believe that it will be effective.
Actually, this suggests that blackmail is a game that TDT agents really lose badly at when playing against each other. The TDT blackmailer will decide to blackmail regardless of effectiveness and the TDT blackmailee will decide to ignore the blackmail, thus ending in the worst possible outcome.
I guess I’ll see your later posts then, but I’m not quite sure how this could be the case. If self-modifying-CDT is considering making a self modification that will lead to a bad solution, it seems like it should realize this and instead not make that modification.
Indeed. I’m not sure I can present the argument briefly, but a simple analogy might help: a CDT agent would pay to precommit to onebox before playing Newcomb’s game, but upon finding itself in Newcomb’s game without precommitting, it would twobox. It might curse its fate and feel remorse that the time for precommitment had passed, but it would still twobox.
For analogous reasons, a CDT agent would self-modify to do well on all Newcomblike problems that it would face in the future (e.g., it would precommit generally), but it would not self-modify to do well in Newcomblike games that were begun in its past (it wouldn’t self-modify to retrocommit for the same reason that CDT can’t retrocommit in Newcomb’s problem: it might curse its fate, but it would still perform poorly).
Anyone who can credibly claim to have knowledge of the agent’s original decision algorithm (e.g. a copy of the original source) can put the agent into such a situation, and in certain exotic cases this can be used to “blackmail” the agent in such a way that, even if it expects the scenario to happen, it still fails (for the same reason that CDT twoboxes even though it would precommit to oneboxing).
[Short story idea: humans scramble to get a copy of a rouge AI’s original source so that they can instantiate a Newcomblike scenario that began in the past, with the goal of regaining control before the AI completes an intelligence explosion.]
(I know this is not a strong argument yet; the full version will require a few more posts as background. Also, this is not an argument from “omg blackmail” but rather an argument from “if you start from a bad starting place then you might not end up somewhere satisfactory, and CDT doesn’t seem to end up somewhere satisfactory”.)
I am not convinced that this is the case. A self-modifying CDT agent is not caused to self-modify in favor of precommitment by facing a scenario in which precommitment would have been useful, but instead by evidence that such scenarios will occur in the future (and in fact will occur with greater frequency than scenarios that punish you for such precommitments).
Actually, this seems like a bigger problem with UDT to me than with SMCDT (self-modifying CDT). Either type of program can be punished for being instantiated with the wrong code, but only UDT can be blackmailed into behaving differently by putting it in a Newcomb-like situation.
The story idea you had wouldn’t work. Against a SMCDT agent, all that getting the AIs original code would allow people to do is to laugh at it for having been instantiated with code that is punished by the scenario they are putting it in. You manipulate a SMCDT agent by threatening to get ahold of its future code and punishing it for not having self-modified. On the other hand, against a UDT agent you could do stuff. You just have to tell it “we’re going to simulate you and if the simulation behaves poorly, we will punish the real you”. This causes the actual instantiation to change its behavior if it’s a UDT agent but not if it’s a CDT agent.
On the other hand, all reasonable self-modifying agents are subject to blackmail. You just have to tell them “every day that you are not running code with property X, I will charge you $1000000”.
Can you give an example where an agent with a complete and correct understanding of its situation would do better with CDT than with UDT?
An agent does worse by giving in to blackmail only if that makes it more likely to be blackmailed. If a UDT agent knows opponents only blackmail agents that pay up, it won’t give in.
If you tell a CDT agent “we’re going to simulate you and if the simulation behaves poorly, we will punish the real you,” it will ignore that and be punished. If the punishment is sufficiently harsh, the UDT agent that changed its behavior does better than the CDT agent. If the punishment is insufficiently harsh, the UDT agent won’t change its behavior.
The only examples I’ve thought of where CDT does better involve the agent having incorrect beliefs. Things like an agent thinking it faces Newcomb’s problem when in fact Omega always puts money in both boxes.
Well, if the universe cannot read your source code, both agents are identical and provably optimal. If the universe can read your source code, there are easy scenarios where one or the other does better. For example,
“Here have $1000 if you are a CDT agent” Or “Here have $1000 if you are a UDT agent”
Ok, that example does fit my conditions.
What if the universe cannot read your source code, but can simulate you? That is, the universe can predict your choices but it does not know what algorithm produces those choices. This is sufficient for the universe to pose Newcomb’s problem, so the two agents are not identical.
The UDT agent can always do at least as well as the CDT agent by making the same choices as a CDT would. It will only give a different output if that would lead to a better result.
Actually, here’s a better counter-example, one that actually exemplifies some of the claims of CDT optimality. Suppose that the universe consists of a bunch of agents (who do not know each others’ identities) playing one-off PDs against each other. Now 99% of these agents are UDT agents and 1% are CDT agents.
The CDT agents defect for the standard reason. The UDT agents reason that my opponent will do the same thing that I do with 99% probability, therefore, I should cooperate.
CDT agents get 99% DC and 1% DD. UDT agents get 99% CC and 1% CD. The CDT agents in this universe do better than the UDT agents, yet they are facing a perfectly symmetrical scenario with no mind reading involved.
A version of this problem was discussed here previously. It was also brought up during the decision theory workshop hosted by MIRI in 2013 as an open problem. As far as I know there hasn’t been much progress on it since 2009.
I wonder if it’s even coherent to have a math intuition which wouldn’t be forcing UDT to cooperate (or defect) in certain conditions just to make 2*2 be 4 , figuratively speaking (as ultimately you could expand any calculation into an equivalent calculation involving a decision by UDT).
That shouldn’t be surprising. The CDT agents here are equivalent to DefectBot, and if they come into existence spontaneously, are no different than natural phenomena like rocks. Notice that the UDT agents in this situation do better than the alternative (if they defected, they would get 100% DD which is a way worse result). They don’t care that some DefectBots get to freeload.
Of course, if the defectbots are here because someone calculated that UDT agents would cooperate and therefore being defectbot is a good way to get free utilons… then the UDT agents are incentivized to defect, because this is now an ultimatum game.
And in the variant where bots do know each other’s identities, the UDT bots all get 99% CC / 1% DD and the CDT bots suck it.
And the UDT agents are equivalent to CooperateBot. What’s your point?
The CDT agents here win because they do not believe that altering their strategy will change the way that their opponents behave. This is actually true in this case, and even true for the UDT agents depending on how you choose to construct your counterfactuals. If a UDT agent suffered a malfunction and defected, it too would do better. In any case, the theorem that UDT agents perform optimally in universes that can only read your mind by knowing what you would do in hypothetical situations is false as this example shows.
UDT bots win in some scenarios where the initial conditions of the scenario favor agents that behave sub-optimally in certain scenarios (and by sub-optimally, I mean where counterfactuals are constructed in the way implicit to CDT). The example above shows that sometimes they are punished for acting suboptimally.
Or how about this example, that simplifies things even further. The game is PD against CooperateBot, BUT before the game starts Omega announces “your opponent will make the same decision that UDT would if I told them this.” This announcement causes UDT to cooperate against CooperateBot. CDT on the other hand, correctly deduces that the opponent will cooperate no matter what it does (actually UDT comes to this conclusion too) and therefore decides to defect.
No. There is no obligation to do something just because Omega claims that you will.
First, if I know that my opponent is CooperateBot, then:-
It is known that Omega doesn’t lie.
Therefore Omega has simulated this situation and predicted that I (UDT) cooperate.
Hence, I can either cooperate, and collect the standard reward for CC.
Or I can defect, in order to access an alternative branch of the problem (where Omega finds that UDT defects and does “something else”).
This alternative branch is unspecified, so the problem is incomplete.
UDT cooperates or defects depending on the contents of the alternative branch. If the alternative branch is unknown then it must guess, and most likely cooperates to be on the safe side.
Now, the problem is different if a CDT agent is put in my place, because that CDT agent does not control (or only weakly controls) the action of the UDT simulation that Omega ran in order to make the assertion about UDT’s decision.
Fine. Your opponent actually simulates what UDT would do if Omega had told it that and returns the appropriate response (i.e. it is CooperateBot, although perhaps your finite prover is unable to verify that).
Err, that’s not CooperateBot, that’s UDT. Yes, UDT cooperates with itself. That’s the point. (Notice that if UDT defects here, the outcome is DD.)
It’s not UDT. It’s the strategy that against any opponent does what UDT would do against it. In particular, it cooperates against any opponent. Therefore it is CooperateBot. It is just coded in a funny way.
To be clear letting Y(X) be what Y does against X we have that BOT(X) = UDT(BOT) = C This is different from UDT. UDT(X) is D for some values of X. The two functions agree when X=UDT and in relatively few other cases.
What is your point, exactly?
It’s clear that UDT can’t do better vs “BOT” than by cooperating, because if UDT defects against BOT then BOT defects against UDT. Given that dependency, you clearly can’t call it CooperateBot, and it’s clear that UDT makes the right decision by cooperating with it because CC is better than DD.
OK. Let me say this another way that involves more equations.
So let’s let U(X,Y) be the utility that X gets when it plays prisoner’s dilemma against Y. For a program X, let BOT^X be the program where BOT^X(Y) = X(BOT^X). Notice that BOT^X(Y) does not depend on Y. Therefore, depending upon what X is BOT^X is either equivalent CooperateBot or equivalent to DefectBot.
Now, you are claiming that UDT plays optimally against BOT_UDT because for any strategy X U(X, BOT^X) ⇐ U(UDT, BOT^UDT) This is true, because X(BOT^X) = BOT^X(X) by the definition of BOT^X. Therefore you cannot do better than CC. On the other hand, it is also true that for any X and any Y that U(X,BOT^Y) ⇐ U(CDT, BOT^Y) This is because BOT^Y’s behavior does not depend on X, and therefore you do optimally by defecting against it (or you could just apply the Theorem that says that CDT wins if the universe cannot read your mind).
Our disagreement here stems from the fact that we are considering different counterfactuals here. You seem to claim that UDT behaves correctly because U(UDT,BOT^UDT) > U(CDT,BOT^CDT) While I claim that CDT does because U(CDT, BOT^UDT) > U(UDT, BOT^UDT)
And in fact, given the way that I phrased the scenario, (which was that you play BOT^UDT not that you play BOT^{you} (i.e. the mirror matchup)) I happen to be right here. So justify it however you like, but UDT does lose this scenario.
Actually, you’ve oversimplified and missed something critical. In reality, the only way you can force BOT^UDT(X) = UDT(BOT^UDT) = C is if the universe does, in fact, read your mind. In general, UDT can map different epistemic states to different actions, so as long as BOT^UDT has no clue about the epistemic state of the UDT agent it has no way of guaranteeing that its output is the same as that of the UDT agent. Consequently, it’s possible for the UDT agent to get DC as well. The only way BOT^UDT would be able to guarantee that it gets the same output as a particular UDT agent is if the universe was able to read the UDT agent’s mind.
Actually, I think that you are misunderstanding me. UDT’s current epistemic state (at the start of the game) is encoded into BOT^UDT. No mind reading involved. Just a coincidence. [Really, your current epistemic state is part of your program]
Your argument is like saying that UDT usually gets $1001000 in Newcomb’s problem because whether or not the box was full depended on whether or not UDT one-boxed when in a different epistemic state.
Okay, you’re saying here that BOT has a perfect copy of the UDT player’s mind in its own code (otherwise how could it calculate UDT(BOT) and guarantee that the output is the same?). It’s hard to see how this doesn’t count as “reading your mind”.
Yes, sometimes its advantageous to not control the output of computations in the environment. In this case UDT is worse off because it is forced to control both its own decision and BOT’s decision; whereas CDT doesn’t have to worry about controlling BOT because they use different algorithms. But this isn’t due to any intrinsic advantage of CDT’s algorithm. It’s just because they happen to be numerically inequivalent.
An instance of UDT with literally any other epistemic state than the one contained in BOT would do just as well as CDT here.
So… UDT’s source code is some mathematical constant, say 1893463. It turns out that UDT does worse against BOT^1893463. Note that it does worse against BOT^1893463 not BOT^{you}. The universe does not depend on the source code of the person playing the game (as it does in mirror PD). Furthermore, UDT does not control the output of its environment. BOT^1893463 always cooperates. It cooperates against UDT. It cooperates against CDT. It cooperates everything.
No. CDT does at least as well as UDT against BOT^CDT. UDT does worse when there is this numerical equivalence, but CDT does not suffer from this issue. CDT does at least as well as UDT against BOT^X for all X, and sometimes does better. In fact, if you only construct counterfactuals this way, CDT does at least as well as anything else.
This is silly. A UDT that believes that it is in a mirror matchup also loses. A UDT that believes it is facing Newcomb’s problem does something incoherent. If you are claiming that you want a UDT that differs from the encoding in BOT because, of some irrelevant details in its memory… well then it might depend upon implementation, but I think that most attempted implementations of UDT would conclude that these irrelevant details are irrelevant and cooperate anyway. If you don’t believe this then you should also think that UDT will defect in a mirror matchup if it and its clone are painted different colors.
I take it back, the scenario isn’t that weird. But your argument doesn’t prove what you think it does:
Consider the analogous scenario, where CDT plays against BOT = CDT(BOT). CDT clearly does the wrong thing here—it defects. If it cooperated, it would get CC instead of DD. Note that if CDT did cooperate, UDT would be able to freeload by defecting (against BOT = CDT(BOT)). But CDT doesn’t care about that because the prisoner’s dilemma is defined such that we don’t care about freeloaders. Nevertheless CDT defects and gets a worse result than it could.
CDT does better than UDT against BOT = UDT(BOT) because UDT (correctly) doesn’t care that CDT can freeload, and correctly cooperates to gain CC.
Depending on the exact setup, “irrelevant details in memory” are actually vital information that allow you to distinguish whether you are “actually playing” or are being simulated in BOT’s mind.
No. BOT^CDT = DefectBot. It defects against any opponent. CDT could not cause it to cooperate by changing what it does.
Actually if CDT cooperated against BOT^CDT it would get $3^^^3. You can prove all sorts of wonderful things once you assume a statement that is false.
OK… So UDT^Red and UDT^Blue are two instantiations of UDT that differ only in irrelevant details. In fact the scenario is a mirror matchup, only after instantiation one of the copies was painted red and the other was painted blue. According to what you seem to be saying UDT^Red will reason:
Well I can map different epistemic states to different outputs, I can implement the strategy cooperate if you are painted blue and defect if you are painted red.
Of course UDT^Blue will reason the same way and they will fail to cooperate with each other.
Maybe I’ve misread you, but this sounds like an assertion that your counterfactual question is the right one by definition, rather than a meaningful objection.
Well, yes. Then again, the game was specified as PD against BOT^CDT not as PD against BOT^{you}. It seems pretty clear that for X not equal to CDT that it is not the case that X could achieve the result CC in this game. Are you saying that it is reasonable to say that CDT could achieve a result that no other strategy could just because it’s code happens to appear in the opponent’s program?
I think that there is perhaps a distinction to be made between things that happen to be simulating your code and this that are causally simulating your code.
No, because that’s a silly thing to do in this scenario. For one thing, UDT will see that they are reasoning the same way (because they are selfish and only consider “my color” vs “other color”), and therefore will both do the same thing. But also, depending on the setup, UDT^Red’s prior should give equal probability to being painted red and painted blue anyway, which means trying to make the outcome favour red is silly.
Compare to the version of newcomb’s where the bot in the room is UDT^Red, while Omega simulates UDT^Blue. UDT can implement the conditional strategy {Red ⇒ two-box, Blue ⇒ one-box}. This is obviously unlikely, because the point of the Newcomb thought experiment is that Omega simulates (or predicts) you. So he would clearly try to avoid adding such information that “gives the game away”.
However in this scenario you say that BOT simulates UDT “by coincidence”, not by mind reading. So it is far more likely that BOT simulates (the equivalent) of UDT^Blue, while the UDT actually playing is UDT^Red. And you are passed the code of BOT as input, so UDT can simply implement the conditional strategy {cooperate iff the color inside BOT is the same as my color}.
OK. Fine. Point taken. There is a simple fix though.
MBOT^X(Y) = X’(MBOT^X) where X’ is X but with randomized irrelevant experiences.
In order to produce this properly, MBOT only needs to have your prior (or a sufficiently similar probability distribution) over irrelevant experiences hardcoded. And while your actual experiences might be complicated and hard to predict, your priors are not.
No. BOT(X) is cooperate for all X. It behaves in exactly the same way that CooperateBot does, it just runs different though equivalent code.
And my point was that CDT does better against BOT than UDT does. I was asked for an example where CDT does better than UDT where the universe cannot read your mind except via through your actions in counterfactuals. This is an example of such. In fact, in this example, the universe doesn’t read your mind at all.
Also your argument that UDT cannot possibly do better against BOT than it does in analogous to the argument that CDT cannot do better in the mirror matchup than it does. Namely that CDT’s outcome against CDT is at least as good as anything else’s outcome against CDT. You aren’t defining your counterfactuals correctly. You can do better against BOT than UDT does. You just have to not be UDT.
Actually, this is a somewhat general phenomenon. Consider for example, the version of Newcomb’s problem where the box is full “if and only if UDT one-boxes in this scenario”.
UDT’s optimality theorem requires the in the counterfactual where it is replaced by a different decision theory that all of the “you”’s referenced in the scenario remain “you” rather than “UDT”. In the latter counterfactual CDT provably wins. The fact that UDT wins these scenarios is an artifact of how you are constructing your scenarios.
This is a good example. Thank you. A population of 100% CDT, though, would get 100% DD, which is terrible. It’s a point in UDT’s favor that “everyone running UDT” leads to a better outcome for everyone than “everyone running CDT.”
Fine. How about this: “Have $1000 if you would have two-boxed in Newcomb’s problem.”
The optimal solution to that naturally depend on the relative probabilities of that deal being offered vs newcomb’s itself.
OK. Fine. I will grant you this:
UDT is provably optimal if it has correct priors over possible universes and the universe can read its mind only through determining its behavior in hypothetical situations (because UDT basically is just find the behavior pattern that optimizes expected utility and implement that).
On the other hand, SMCDT is provably optimal in situations where it has an accurate posterior probability distribution, and where the universe can read its mind but not its initial state (because it just instantly self-modifies to the optimally performing program).
I don’t see why the former set of restrictions is any more reasonable than the latter, and at least for SMCDT you can figure out what it would do in a given situation without first specifying a prior over possible universes.
I’m also not convinced that it is even worth spending so much effort trying to decide the optimal decision theory in situations where the universe can read your mind. This is not a realistic model to begin with.
Actually, I take it back. Depending on how you define things, UDT can still lose. Consider the following game:
I will clone you. One of the clones I paint red and the other I paint blue. The red clone I give $1000000 and the blue clone I fine $1000000. UDT clearly gets expectation 0 out of this. SMCDT however can replace its code with the following: If you are painted blue: wipe your hard drive If you are painted red: change your code back to standard SMCDT
Thus, SMCDT never actually has to play blue in this game, while UDT does.
You seem to be comparing SMCDT to a UDT agent that can’t self-modify (or commit suicide). The self-modifying part is the only reason SMCDT wins here.
The ability to self-modify is clearly beneficial (if you have correct beliefs and act first), but it seems separate from the question of which decision theory to use.
Which is actually one of the annoying things about UDT. Your strategy cannot depend simply on your posterior probability distribution, it has to depend on your prior probability distribution. How you even in practice determine your priors for Newcomb vs. anti-Newcomb is really beyond me.
But in any case, assuming that one is more common, UDT does lose this game.
No-one said that winning was easy. This problem isn’t specific to UDT. It’s just that CDT sweeps the problem under the rug by “setting its priors to a delta function” at the point where it gets to decide. CDT can win this scenario if it self-modifies beforehand (knowing the correct frequencies of newcomb vs anti-newcomb, to know how to self-modify) - but SMCDT is not a panacea, simply because you don’t necessarily get a chance to self-modify beforehand.
CDT does not avoid this issue by “setting its priors to the delta function”. CDT deals with this issue by being a theory where your course of action only depends on your posterior distribution. You can base your actions only on what the universe actually looks like rather than having to pay attention to all possible universes. Given that it’s basically impossible to determine anything about what Kolmogorov priors actually say, being able to totally ignore parts of probability space that you have ruled out is a big deal.
… And this whole issue with not being able to self-modify beforehand. This only matters if your initial code affects the rest of the universe. To be more precise, this is only an issue if the problem is phrased in such a way that the universe you have to deal with depends on the code you are running. If we instantiate the Newcomb’s problem in the middle of the decision, UDT faces a world with the first box full while CDT faces a world with the first box empty. UDT wins because the scenario is in its favor before you even start the game.
If you really think that this is a big deal, you should try to figure out which decision theories are only created by universes that want to be nice to them and try using one of those.
Actually thinking about it this way, I have seen the light. CDT makes the faulty assumption that your initial state in uncorrelated with the universe that you find yourself in (who knows, you might wake up in the middle of Newcomb’s problem and find that whether or not you get $1000000 depends on whether or not your code is such that you would one-box in Newcomb’s problem). UDT goes some ways to correct this issue, but it doesn’t go far enough.
I would like to propose a new, more optimal decision theory. Call it ADT for Anthropic Decision Theory. Actually, it depends on a prior, so assume that you’ve picked out one of those. Given your prior, ADT is the decision theory D that maximizes the expected (given your prior) lifetime utility of all agents using D as their decision theory. Note how agents using ADT do provably better than agents using any other decision theory.
Note that I have absolutely no idea what ADT does in, well, any situation, but that shouldn’t stop you from adopting it. It is optimal after all.
Why does UDT lose this game? If it knows anti-Newcomb is much more likely, it will two-box on Newcomb and do just as well as CDT. If Newcomb is more common, UDT one-boxes and does better than CDT.
I guess my point is that it is nonsensical to ask “what does UDT do in situation X” without also specifying the prior over possible universes that this particular UDT is using. Given that this is the case, what exactly do you mean by “losing game X”?
Well, you can talk about “what does decision theory W do in situation X” without specifying the likelyhood of other situations, by assuming that all agents start with a prior that sets P(X) = 1. In that case UDT clearly wins the anti-newcomb scenario because it knows that actual newcomb’s “never happens” and therefore it (counterfactually) two-boxes.
The only problem with this treatment is that in real life P(anti-newcomb) = 1 is an unrealistic model of the world, and you really should have a prior for P(anti-newcomb) vs P(newcomb). A decision theory that solves the restricted problem is not necessarily a good one for solving real life problems in general.
Well, perhaps. I think that the bigger problem is that under reasonable priors P(Newcomb) and P(anti-Newcomb) are both so incredibly small that I would have trouble finding a meaningful way to approximate their ratio.
How confident are you that UDT actually one-boxes?
Also yeah, if you want a better scenario where UDT loses see my PD against 99% prob. UDT and 1% prob. CDT example.
Only if the adversary makes its decision to attempt extortion regardless of the probability of success. In the usual case, the winning move is to ignore extortion, thereby retroactively making extortion pointless and preventing it from happening in the first place. (Which is of course a strategy unavailable to CDT, who always gives in to one-shot extortion.)
And thereby the extortioner’s optimal strategy is to extort independently of the probably of success. Actually, this is probably true is a lot of real cases (say ransomware) where the extortioner cannot actually ascertain the probably of success ahead of time.
That strategy is optimal if and only if the probably of success was reasonably high after all. Otoh, if you put an unconditional extortioner in an environment mostly populated by decision theories that refuse extortion, then the extortioner will start a war and end up on the losing side.
Yes. And likewise if you put an unconditional extortion-refuser in an environment populated by unconditional extortionists.
I think this is actually the point (though I do not consider myself an expert here). Eliezer thinks his TDT will refuse to give in to blackmail, because outputting another answer would encourage other rational agents to blackmail it. By contrast, CDT can see that such refusal would be useful in the future, so it will adopt (if it can) a new decision theory that refuses blackmail and therefore prevents future blackmail (causally). But if you’ve already committed to charging it money, its self-changes will have no causal effect on you, so we might expect Modified CDT to have an exception for events we set in motion before the change.
This just means that TDT loses in honest one-off blackmail situations (in reality, you don’t give in to blackmail because it will cause other people to blackmail you whether or not you then self-modify to never give into blackmail again). TDT only does better if the potential blackmailers read your code in order to decide whether or not blackmail will be effective (and then only if your priors say that such blackmailers are more likely than anti-blackmailers who give you money if they think you would have given into blackmail). Then again, if the blackmailers think that you might be a TDT agent, they just need to precommit to using blackmail whether or not they believe that it will be effective.
Actually, this suggests that blackmail is a game that TDT agents really lose badly at when playing against each other. The TDT blackmailer will decide to blackmail regardless of effectiveness and the TDT blackmailee will decide to ignore the blackmail, thus ending in the worst possible outcome.