Correct. The last time I was negotiating with a self-described FDT agent I did it anyway. đ
My utility function is âmake functional decision theorists look stupidâ, which I satisfy by blackmailing them. Either they cave, which mean I win, or they donât cave, which demonstrates that FDT is stupid.
If your original agent is replacing themselves as a threat to FDT, because they want FDT to pay up, then FDT rightly ignores it. Thus the original agent, which just wants paperclips or whatever, has no reason to threaten FDT.
If we postulate a different scenario where your original agent literally terminally values messing over FDT, then FDT would pay up (if FDT actually believes it isnât a threat). Similarly, if part of your values has you valuing turning metal into paperclips and I value metal being anything-but-paperclips, I/âFDT would pay you to avoid turning metal into paperclips. If you had different valuesâeven opposite ones along various axesâthen FDT just trades with you.
However FDT tries to close off the incentives for strategic alterations of values, even by proxy, to threaten.
So I see this as a non-issue.
Iâm not sure I see the pathological case of the problem statement: an agent has utility function of âDo worst possible action to agents who exactly implement (Specific Decision Theory)â as a problem either. You can construct an instance for any decision theory. Do you have a specific idea how you would get past this? FDT would obviously modify itself if it can use that to get around the detection (and the results are important enough to not just eat the cost).
My deontological terminal value isnât to causally win. Itâs for FTD agents to acausally lose. Either I win, or the FDT agents abandon FDT. (Which proves that FDT is an exploitable decision theory.)
Iâm not sure I see the pathological case of the problem statement: an agent has utility function of âDo worst possible action to agents who exactly implement (Specific Decision Theory)â as a problem either. Do you have a specific idea how you would get past this?
Thereâs a Daoist answer: Donât legibly and universally precommit to a decision theory.
But the exploit Iâm trying to point to is simpler than Daoist decision theory. Here it is: Functional decision theory conflates two decisions:
Use FDT.
Determine a strategy via FDT.
Iâm blackmailing contingent on decision 1 and not on decision 2. Iâm not doing this because I need to win. Iâm doing it because I can. Because it puts FDT agents in a hilarious lose-lose situation.
The thing FDT disciples donât understand is that Iâm happy to take the scenario where FDT agents donât cave to blackmail. Because of this, FDT demands that FDT agents cave to my blackmail.
I assume what youâre going for with your conflation of the two decisions is this, though you arenât entirely clear on what you mean:
Some agent starts with some (potentially broken in various manners, like bad heuristics or unable to consider certain impacts) decision theory, because thereâs no magical apriori decision algorithm
So the agent is using that DT to decide how to make better decisions that get more of what it wants
CDT would modify into Son-of-CDT typically at this step
The agent is deciding whether it should use FDT.
It is âgood enoughâ that it can predict if it decides to just completely replace itself with FDT it will get punched by your agent, or it will have to pay to avoid being punched.
So it doesnât completely swap out to FDT, even if it is strictly better in all problems that arenât dependent on your decision theory
But it can still follow FDT to generate actions it should take, which wonât get it punished by you?
Aside: Iâm not sure thereâs a strong definite boundary between âswapping to FDTâ (your âuse FDTâ) and taking FDTâs outputs to get actions that you should take. Ex: If I keep my original decision loop but it just consistently outputs âFDT is best to useâ, is that swapping to FDT according to you?
Does if (true) { FDT() } else { CDT() } count as FDT or not?
(Obviously you can construct a class of agents which have different levels that they consider this at, though)
Thereâs a Daoist answer: Donât legibly and universally precommit to a decision theory.
But youâre whatever agent you are. You are automatically committed to whatever decision theory you implement. I can construct a similar scenario for any DT.
âI value punishing agents that swap themselves to being DecisionTheory.â
Or just âI value punishing agents that use DecisionTheory.â
Am I misunderstanding what you mean?
How do you avoid legibly being committed to a decision theory, when thatâs how you decide to take actions in the first place? Inject a bunch of randomness so others canât analyze your algorithm? Make your internals absurdly intricate to foil most predictors, and only expose a legible decision making part in certain problems?
FDT, I believe, would acquire uncertainty about its algorithm if it expects that to actually be beneficial. It isnât universally-glomarizing like your class of DaoistDTs, but I shouldnât commit to being illegible either.
I agree with the argument for not replacing your decision theory wholesale with one that does not actually get you the most utility (according to how your current decision theory makes decisions). However I still donât see how this exploits FDT.
Choosing FDT loses in the environment against you, so our thinking-agent doesnât choose to swap out to FDTâassuming it doesnât just eat the cost for all those future potential trades. It still takes actions as close to FDT as it can as far as I can tell.
I can still construct a symmetric agent which goes âOh you are keeping around all that algorithmic cruft around shelling out to FDT when you just follow it always? Well I like punishing those kinds of agents.â
If the problem specifies that it is an FDT agent from the start, then yes FDT gets punished by your agent. And, how is that exploitable?
The original agent before it replaced itself with FDT shouldnât have done that, given full knowledge of the scenario it faced (only one decision forevermore, against an agent which punishes agents which only implement FDT), but thatâs just the problem statement?
The thing FDT disciples donât understand is that Iâm happy to take the scenario where FDT agents donât cave to blackmail.
? Thatâs the easy part. You are just describing an agent that likes messing over FDT, so it benefits you regardless of the FDT agent giving into blackmail or not.
This encourages agents which are deciding what decision theory to self modify into (or make servant agents) to not use FDT for it, if they expect to get more utility by avoiding that.
Correct. The last time I was negotiating with a self-described FDT agent I did it anyway. đ
My utility function is âmake functional decision theorists look stupidâ, which I satisfy by blackmailing them. Either they cave, which mean I win, or they donât cave, which demonstrates that FDT is stupid.
If your original agent is replacing themselves as a threat to FDT, because they want FDT to pay up, then FDT rightly ignores it. Thus the original agent, which just wants paperclips or whatever, has no reason to threaten FDT.
If we postulate a different scenario where your original agent literally terminally values messing over FDT, then FDT would pay up (if FDT actually believes it isnât a threat). Similarly, if part of your values has you valuing turning metal into paperclips and I value metal being anything-but-paperclips, I/âFDT would pay you to avoid turning metal into paperclips. If you had different valuesâeven opposite ones along various axesâthen FDT just trades with you.
However FDT tries to close off the incentives for strategic alterations of values, even by proxy, to threaten.
So I see this as a non-issue. Iâm not sure I see the pathological case of the problem statement: an agent has utility function of âDo worst possible action to agents who exactly implement (Specific Decision Theory)â as a problem either. You can construct an instance for any decision theory. Do you have a specific idea how you would get past this? FDT would obviously modify itself if it can use that to get around the detection (and the results are important enough to not just eat the cost).
My deontological terminal value isnât to causally win. Itâs for FTD agents to acausally lose. Either I win, or the FDT agents abandon FDT. (Which proves that FDT is an exploitable decision theory.)
Thereâs a Daoist answer: Donât legibly and universally precommit to a decision theory.
But the exploit Iâm trying to point to is simpler than Daoist decision theory. Here it is: Functional decision theory conflates two decisions:
Use FDT.
Determine a strategy via FDT.
Iâm blackmailing contingent on decision 1 and not on decision 2. Iâm not doing this because I need to win. Iâm doing it because I can. Because it puts FDT agents in a hilarious lose-lose situation.
The thing FDT disciples donât understand is that Iâm happy to take the scenario where FDT agents donât cave to blackmail. Because of this, FDT demands that FDT agents cave to my blackmail.
I assume what youâre going for with your conflation of the two decisions is this, though you arenât entirely clear on what you mean:
Some agent starts with some (potentially broken in various manners, like bad heuristics or unable to consider certain impacts) decision theory, because thereâs no magical apriori decision algorithm
So the agent is using that DT to decide how to make better decisions that get more of what it wants
CDT would modify into Son-of-CDT typically at this step
The agent is deciding whether it should use FDT.
It is âgood enoughâ that it can predict if it decides to just completely replace itself with FDT it will get punched by your agent, or it will have to pay to avoid being punched.
So it doesnât completely swap out to FDT, even if it is strictly better in all problems that arenât dependent on your decision theory
But it can still follow FDT to generate actions it should take, which wonât get it punished by you?
Aside: Iâm not sure thereâs a strong definite boundary between âswapping to FDTâ (your âuse FDTâ) and taking FDTâs outputs to get actions that you should take. Ex: If I keep my original decision loop but it just consistently outputs âFDT is best to useâ, is that swapping to FDT according to you?
Does
if (true) { FDT() } else { CDT() }
count as FDT or not?(Obviously you can construct a class of agents which have different levels that they consider this at, though)
But youâre whatever agent you are. You are automatically committed to whatever decision theory you implement. I can construct a similar scenario for any DT.
âI value punishing agents that swap themselves to being
DecisionTheory
.âOr just âI value punishing agents that use
DecisionTheory
.âAm I misunderstanding what you mean?
How do you avoid legibly being committed to a decision theory, when thatâs how you decide to take actions in the first place? Inject a bunch of randomness so others canât analyze your algorithm? Make your internals absurdly intricate to foil most predictors, and only expose a legible decision making part in certain problems?
FDT, I believe, would acquire uncertainty about its algorithm if it expects that to actually be beneficial. It isnât universally-glomarizing like your class of DaoistDTs, but I shouldnât commit to being illegible either.
I agree with the argument for not replacing your decision theory wholesale with one that does not actually get you the most utility (according to how your current decision theory makes decisions). However I still donât see how this exploits FDT.
Choosing FDT loses in the environment against you, so our thinking-agent doesnât choose to swap out to FDTâassuming it doesnât just eat the cost for all those future potential trades. It still takes actions as close to FDT as it can as far as I can tell.
I can still construct a symmetric agent which goes âOh you are keeping around all that algorithmic cruft around shelling out to FDT when you just follow it always? Well I like punishing those kinds of agents.â If the problem specifies that it is an FDT agent from the start, then yes FDT gets punished by your agent. And, how is that exploitable?
The original agent before it replaced itself with FDT shouldnât have done that, given full knowledge of the scenario it faced (only one decision forevermore, against an agent which punishes agents which only implement FDT), but thatâs just the problem statement?
? Thatâs the easy part. You are just describing an agent that likes messing over FDT, so it benefits you regardless of the FDT agent giving into blackmail or not. This encourages agents which are deciding what decision theory to self modify into (or make servant agents) to not use FDT for it, if they expect to get more utility by avoiding that.
I definitely would like to hear the details! (I mean, of last particular case)