Exploiting EDT
The problem with EDT is, as David Lewis put it, its “irrational policy of managing the news” (Lewis, 1981): it chooses actions not only because of their effects of the world, but also because of what the fact that it’s taking these actions tells it about events the agent can’t affect at all. The canonical example is the smoking lesion problem.
I’ve long been uncomfortable with the smoking lesion problem as the case against EDT, because an AI system would know its own utility function, and would therefore know whether or not it values “smoking” (presumably in the AI case it would be a different goal), and if it updates on this fact it would behave correctly in the smoking lesion. (This is an AI-centric version of the “tickle defense” of EDT.) Nate and I have come up with a variant I find much more convincing: a way to get EDT agents to pay you for managing the news for them, which works by the same mechanism that makes these agents one-box in Newcomb’s problem. (It’s a variation of the thought experiment in my LessWrong post on “the sin of updating when you can change whether you exist”.)
Suppose that there’s this EDT agent around which plays the stock market. It’s pretty good at doing so, and has amassed a substantial net worth, but, unsurprisingly, it’s not perfect; there’s always a small chance of its investments going south. Currently, for example, there’s a bit of a hubbub around the CEO of one of the companies the agent has been investing in, and the agent assigns a 0.4% chance that there’s a scandal about to be revealed which will force that CEO to resign; if that’s the case, it expects to lose $150,000,000.
Along comes a clever AI researcher, who is known to be extremely good at predicting how different kinds of AI systems will react on simple thought experiments, who has access to the source code of our agent, and who happens to be able to access information about whether or not there’s a scandal. The researcher could find out and sell the information to the agent, but since it’s a low-probability scenario, the value of that information wouldn’t be that large. Instead, the researcher has a better idea: they are going to
find out whether or not there is a scandal;
figure out whether or not the agent will pay them $100,000,000 if they ask it to in the next step;
if either (a) there is no scandal and the researcher predicts that the agent will pay up, or (b) there is a scandal and they predict that the agent won’t pay up, then they will send the agent a pre-drafted e-mail that explains this whole procedure, explains that either case (a) or (b) turned out to be true (but not which one), and asks the agent to immediately pay the researcher $100,000,000.
We assume that the agent has sufficient information about the researcher to know that they are trustworthy and a good predictor of the agent’s behavior. Then, as an EDT agent, it’ll calculate expected utility by conditioning on its observations and on the action it’s considering taking, as follows:
Conditioning on receiving the e-mail and paying up, the AI is in case (a): it’s paying $100,000,000, but there is no scandal, so it’s not losing the $150,000,000.
Conditioning on receiving the e-mail and not paying up, the AI is in case (b): it’s not paying the researcher, but there’s a scandal, and it’s going to lose $150,000,000.
The first of these leads to higher conditional expected utility, so the AI pays up. The researcher decides to use the money to fund investigations into better decision theories.
- I turned decision theory problems into memes about trolleys by 30 Oct 2024 20:13 UTC; 104 points) (
- Disentangling four motivations for acting in accordance with UDT by 5 Nov 2023 21:26 UTC; 33 points) (
- Did EDT get it right all along? Introducing yet another medical Newcomb problem by 24 Jan 2017 11:43 UTC; 22 points) (
- Policy Selection Solves Most Problems by 1 Dec 2017 0:35 UTC; 21 points) (
- Forum Digest: Updateless Decision Theory by 20 Mar 2015 0:22 UTC; 15 points) (
- Smoking Lesion Steelman by 2 Jul 2017 2:17 UTC; 9 points) (
- Splitting Decision Theories by 22 Sep 2017 0:47 UTC; 8 points) (
- Comparing LICDT and LIEDT by 21 Oct 2017 23:41 UTC; 4 points) (
- Utility indifference and infinite improbability drives by 29 Nov 2014 6:26 UTC; 2 points) (
- All the indifference designs by 2 Jun 2017 16:20 UTC; 2 points) (
- 12 Oct 2023 3:17 UTC; 2 points) 's comment on Vivek Hebbar’s Shortform by (
- Mixed-Strategy Ratifiability Implies CDT=EDT by 15 Nov 2017 4:22 UTC; 1 point) (
- 2 Feb 2017 22:52 UTC; 1 point) 's comment on Is Evidential Decision Theory presumptuous? by (
Nice example! I think I understood better why this picks out the particular weakness of EDT (and why it’s not a general exploit that can be used against any DT) when I thought of it less as a money-pump and more as “Not only does EDT want to manage the news, you can get it to pay you a lot for the privilege”.
There is a nuance that needs to be mentioned here. If the EDT agent is aware of the researcher’s ploys ahead of time, it will set things up so that emails from the researcher go straight to the spam folder, block the researcher’s calls, and so on. It is not actually happy to pay the researcher for managing the news!
This is less pathological than listening to the researcher and paying up, but it’s still an odd news-management strategy that’s result of EDT.
True. This looks to me like an effect of EDT not being stable under self-modification, although here the issue is handicapping itself through external means rather than self-modification—like, if you offer a CDT agent a potion that will make it unable to lift more than one box before it enters Newcomb’s problem (i.e., before Omega makes its observation of the agent), then it’ll cheerfully take it and pay you for the privilege.
Thanks! I didn’t really think at all about whether or not “money-pump” was the appropriate word (I’m not sure what the exact definition is); have now changed “way to money-pump EDT agents” into “way to get EDT agents to pay you for managing the news for them”.
Hm, I don’t know what the definition is either. In my head, it means “can get an arbitrary amount of money from”, e.g. by taking it around a preference loop as many times as you like. In any case, glad the feedback was helpful.
I find this surprising, and quite interesting.
Here’s what I’m getting when I try to translate the Tickle Defense:
“If this argument works, the AI should be able to recognize that, and predict the AI researcher’s prediction. It knows that it is already the type of agent that will say yes, effectively screening-off its action from the AI researcher’s prediction. When it conditions on refusing to pay, it still predicts that the AI researcher thought it would pay up, and expects the fiasco with the same probability as ever. Therefore, it refuses to pay. By way of contradiction, we conclude that the original argument doesn’t work.”
This is implausible, since it seems quite likely that conditioning on its “don’t pay up” action causes the AI to consider a universe in which this whole argument doesn’t work (and the AI researcher sent it a letter knowing that it wouldn’t pay, following (b) in the strategy). However, it does highlight the importance of how the EDT agent is computing impossible possible worlds.
More technically, we might assume that the AI is using a good finite-time approximation one of the logical priors that has been explored, conditioned on the description of the scenario. We include a logical description of its own source code and physical computer [making the agent unable to consider disruptions to its machine, but this isn’t important]. To decide actions, the agent makes decisions by the ambient chicken rule: if the agent can prove what action it will take, it does something different from that. Otherwise, it takes the action with the highest expected utility (according to Bayesian conditional).
Then, the agent cannot predict that it will give the researcher money, because it doesn’t know whether it will trip its chicken clause. However, it knows that the researcher will make a correct prediction. So, it seems that it will pay up.
The tickle defense fails as a result of the chicken rule.
But isn’t triggering a chicken clause impossible without a contradiction?
I like this! You could also post it to Less Wrong without any modifications.
I wonder if this example can be used to help pin down desiderata for decisions or decision counterfactuals. What axiom(s) for decisions would avoid this general class of exploits?