There’s a couple different ways of exploiting an FDT agent. One method is to notice that FDT agents have implicitly precommitted to FDT (rather than the theorist’s intended terminal value function). It’s therefore possible to contrive scenarios in which those two objectives diverge.
Another method is to modify your own value function such that “make functional decision theorists look stupid” becomes a terminal value. After you do that, you can blackmail them with impunity.
FDT is a reasonable heuristic, but it’s not secure against pathological hostile action.
Correct. The last time I was negotiating with a self-described FDT agent I did it anyway. 😛
My utility function is “make functional decision theorists look stupid”, which I satisfy by blackmailing them. Either they cave, which mean I win, or they don’t cave, which demonstrates that FDT is stupid.
If your original agent is replacing themselves as a threat to FDT, because they want FDT to pay up, then FDT rightly ignores it. Thus the original agent, which just wants paperclips or whatever, has no reason to threaten FDT.
If we postulate a different scenario where your original agent literally terminally values messing over FDT, then FDT would pay up (if FDT actually believes it isn’t a threat). Similarly, if part of your values has you valuing turning metal into paperclips and I value metal being anything-but-paperclips, I/FDT would pay you to avoid turning metal into paperclips. If you had different values—even opposite ones along various axes—then FDT just trades with you.
However FDT tries to close off the incentives for strategic alterations of values, even by proxy, to threaten.
So I see this as a non-issue.
I’m not sure I see the pathological case of the problem statement: an agent has utility function of ‘Do worst possible action to agents who exactly implement (Specific Decision Theory)’ as a problem either. You can construct an instance for any decision theory. Do you have a specific idea how you would get past this? FDT would obviously modify itself if it can use that to get around the detection (and the results are important enough to not just eat the cost).
My deontological terminal value isn’t to causally win. It’s for FTD agents to acausally lose. Either I win, or the FDT agents abandon FDT. (Which proves that FDT is an exploitable decision theory.)
I’m not sure I see the pathological case of the problem statement: an agent has utility function of ‘Do worst possible action to agents who exactly implement (Specific Decision Theory)’ as a problem either. Do you have a specific idea how you would get past this?
There’s a Daoist answer: Don’t legibly and universally precommit to a decision theory.
But the exploit I’m trying to point to is simpler than Daoist decision theory. Here it is: Functional decision theory conflates two decisions:
Use FDT.
Determine a strategy via FDT.
I’m blackmailing contingent on decision 1 and not on decision 2. I’m not doing this because I need to win. I’m doing it because I can. Because it puts FDT agents in a hilarious lose-lose situation.
The thing FDT disciples don’t understand is that I’m happy to take the scenario where FDT agents don’t cave to blackmail. Because of this, FDT demands that FDT agents cave to my blackmail.
I assume what you’re going for with your conflation of the two decisions is this, though you aren’t entirely clear on what you mean:
Some agent starts with some (potentially broken in various manners, like bad heuristics or unable to consider certain impacts) decision theory, because there’s no magical apriori decision algorithm
So the agent is using that DT to decide how to make better decisions that get more of what it wants
CDT would modify into Son-of-CDT typically at this step
The agent is deciding whether it should use FDT.
It is ‘good enough’ that it can predict if it decides to just completely replace itself with FDT it will get punched by your agent, or it will have to pay to avoid being punched.
So it doesn’t completely swap out to FDT, even if it is strictly better in all problems that aren’t dependent on your decision theory
But it can still follow FDT to generate actions it should take, which won’t get it punished by you?
Aside: I’m not sure there’s a strong definite boundary between ‘swapping to FDT’ (your ‘use FDT’) and taking FDT’s outputs to get actions that you should take. Ex: If I keep my original decision loop but it just consistently outputs ‘FDT is best to use’, is that swapping to FDT according to you?
Does if (true) { FDT() } else { CDT() } count as FDT or not?
(Obviously you can construct a class of agents which have different levels that they consider this at, though)
There’s a Daoist answer: Don’t legibly and universally precommit to a decision theory.
But you’re whatever agent you are. You are automatically committed to whatever decision theory you implement. I can construct a similar scenario for any DT.
‘I value punishing agents that swap themselves to being DecisionTheory.’
Or just ‘I value punishing agents that use DecisionTheory.’
Am I misunderstanding what you mean?
How do you avoid legibly being committed to a decision theory, when that’s how you decide to take actions in the first place? Inject a bunch of randomness so others can’t analyze your algorithm? Make your internals absurdly intricate to foil most predictors, and only expose a legible decision making part in certain problems?
FDT, I believe, would acquire uncertainty about its algorithm if it expects that to actually be beneficial. It isn’t universally-glomarizing like your class of DaoistDTs, but I shouldn’t commit to being illegible either.
I agree with the argument for not replacing your decision theory wholesale with one that does not actually get you the most utility (according to how your current decision theory makes decisions). However I still don’t see how this exploits FDT.
Choosing FDT loses in the environment against you, so our thinking-agent doesn’t choose to swap out to FDT—assuming it doesn’t just eat the cost for all those future potential trades. It still takes actions as close to FDT as it can as far as I can tell.
I can still construct a symmetric agent which goes ‘Oh you are keeping around all that algorithmic cruft around shelling out to FDT when you just follow it always? Well I like punishing those kinds of agents.’
If the problem specifies that it is an FDT agent from the start, then yes FDT gets punished by your agent. And, how is that exploitable?
The original agent before it replaced itself with FDT shouldn’t have done that, given full knowledge of the scenario it faced (only one decision forevermore, against an agent which punishes agents which only implement FDT), but that’s just the problem statement?
The thing FDT disciples don’t understand is that I’m happy to take the scenario where FDT agents don’t cave to blackmail.
? That’s the easy part. You are just describing an agent that likes messing over FDT, so it benefits you regardless of the FDT agent giving into blackmail or not.
This encourages agents which are deciding what decision theory to self modify into (or make servant agents) to not use FDT for it, if they expect to get more utility by avoiding that.
Exploitable? Please explain!
There’s a couple different ways of exploiting an FDT agent. One method is to notice that FDT agents have implicitly precommitted to FDT (rather than the theorist’s intended terminal value function). It’s therefore possible to contrive scenarios in which those two objectives diverge.
Another method is to modify your own value function such that “make functional decision theorists look stupid” becomes a terminal value. After you do that, you can blackmail them with impunity.
FDT is a reasonable heuristic, but it’s not secure against pathological hostile action.
“Modifying your utility function” is called threat-by-proxy and FDT agents ignore them, ao you are deincentivized to do this.
Not if you do it anyway.
“Saying you are gonna do it anyway in hope that FDT agent yields” and “doing it anyway” are two very different things.
Correct. The last time I was negotiating with a self-described FDT agent I did it anyway. 😛
My utility function is “make functional decision theorists look stupid”, which I satisfy by blackmailing them. Either they cave, which mean I win, or they don’t cave, which demonstrates that FDT is stupid.
If your original agent is replacing themselves as a threat to FDT, because they want FDT to pay up, then FDT rightly ignores it. Thus the original agent, which just wants paperclips or whatever, has no reason to threaten FDT.
If we postulate a different scenario where your original agent literally terminally values messing over FDT, then FDT would pay up (if FDT actually believes it isn’t a threat). Similarly, if part of your values has you valuing turning metal into paperclips and I value metal being anything-but-paperclips, I/FDT would pay you to avoid turning metal into paperclips. If you had different values—even opposite ones along various axes—then FDT just trades with you.
However FDT tries to close off the incentives for strategic alterations of values, even by proxy, to threaten.
So I see this as a non-issue. I’m not sure I see the pathological case of the problem statement: an agent has utility function of ‘Do worst possible action to agents who exactly implement (Specific Decision Theory)’ as a problem either. You can construct an instance for any decision theory. Do you have a specific idea how you would get past this? FDT would obviously modify itself if it can use that to get around the detection (and the results are important enough to not just eat the cost).
My deontological terminal value isn’t to causally win. It’s for FTD agents to acausally lose. Either I win, or the FDT agents abandon FDT. (Which proves that FDT is an exploitable decision theory.)
There’s a Daoist answer: Don’t legibly and universally precommit to a decision theory.
But the exploit I’m trying to point to is simpler than Daoist decision theory. Here it is: Functional decision theory conflates two decisions:
Use FDT.
Determine a strategy via FDT.
I’m blackmailing contingent on decision 1 and not on decision 2. I’m not doing this because I need to win. I’m doing it because I can. Because it puts FDT agents in a hilarious lose-lose situation.
The thing FDT disciples don’t understand is that I’m happy to take the scenario where FDT agents don’t cave to blackmail. Because of this, FDT demands that FDT agents cave to my blackmail.
I assume what you’re going for with your conflation of the two decisions is this, though you aren’t entirely clear on what you mean:
Some agent starts with some (potentially broken in various manners, like bad heuristics or unable to consider certain impacts) decision theory, because there’s no magical apriori decision algorithm
So the agent is using that DT to decide how to make better decisions that get more of what it wants
CDT would modify into Son-of-CDT typically at this step
The agent is deciding whether it should use FDT.
It is ‘good enough’ that it can predict if it decides to just completely replace itself with FDT it will get punched by your agent, or it will have to pay to avoid being punched.
So it doesn’t completely swap out to FDT, even if it is strictly better in all problems that aren’t dependent on your decision theory
But it can still follow FDT to generate actions it should take, which won’t get it punished by you?
Aside: I’m not sure there’s a strong definite boundary between ‘swapping to FDT’ (your ‘use FDT’) and taking FDT’s outputs to get actions that you should take. Ex: If I keep my original decision loop but it just consistently outputs ‘FDT is best to use’, is that swapping to FDT according to you?
Does
if (true) { FDT() } else { CDT() }
count as FDT or not?(Obviously you can construct a class of agents which have different levels that they consider this at, though)
But you’re whatever agent you are. You are automatically committed to whatever decision theory you implement. I can construct a similar scenario for any DT.
‘I value punishing agents that swap themselves to being
DecisionTheory
.’Or just ‘I value punishing agents that use
DecisionTheory
.’Am I misunderstanding what you mean?
How do you avoid legibly being committed to a decision theory, when that’s how you decide to take actions in the first place? Inject a bunch of randomness so others can’t analyze your algorithm? Make your internals absurdly intricate to foil most predictors, and only expose a legible decision making part in certain problems?
FDT, I believe, would acquire uncertainty about its algorithm if it expects that to actually be beneficial. It isn’t universally-glomarizing like your class of DaoistDTs, but I shouldn’t commit to being illegible either.
I agree with the argument for not replacing your decision theory wholesale with one that does not actually get you the most utility (according to how your current decision theory makes decisions). However I still don’t see how this exploits FDT.
Choosing FDT loses in the environment against you, so our thinking-agent doesn’t choose to swap out to FDT—assuming it doesn’t just eat the cost for all those future potential trades. It still takes actions as close to FDT as it can as far as I can tell.
I can still construct a symmetric agent which goes ‘Oh you are keeping around all that algorithmic cruft around shelling out to FDT when you just follow it always? Well I like punishing those kinds of agents.’ If the problem specifies that it is an FDT agent from the start, then yes FDT gets punished by your agent. And, how is that exploitable?
The original agent before it replaced itself with FDT shouldn’t have done that, given full knowledge of the scenario it faced (only one decision forevermore, against an agent which punishes agents which only implement FDT), but that’s just the problem statement?
? That’s the easy part. You are just describing an agent that likes messing over FDT, so it benefits you regardless of the FDT agent giving into blackmail or not. This encourages agents which are deciding what decision theory to self modify into (or make servant agents) to not use FDT for it, if they expect to get more utility by avoiding that.
I definitely would like to hear the details! (I mean, of last particular case)