in the ADT paper, the asymptotic dominance argument is about the limit of the agent’s action as epsilon goes to 0. This limit is not necessarily computable, so the embedder can’t contain the agent, since it doesn’t know epsilon. So the evil problem doesn’t work.
Agreed that the evil problem doesn’t work for the original ADT paper. In the original ADT paper, the agents are allowed to output distributions over moves. I didn’t like this because it implicitly assumes that it’s possible for the agent to perfectly randomize, and I think randomization is better modeled by a (deterministic) action that consults an environmental random-number generator, which may be correlated with other things.
What I meant was that, in the version of argmax that I set up, if A is the two constant policies “take blank box” and “take shiny box”, then for the embedder F where the opponent runs argmax to select which box to fill, the argmax agent will converge to deterministically randomizing between the two policies, by the logical inductor assigning very similar expected utility to both options such that the inductor can’t predict which action will be chosen. And this occurs because the inductor outputting more of “take the blank box” will have F(shiny) converge to a higher expected value (so argmax will learn to copy that), and the inductor outputting more of “take the shiny box” will have F(blank) converge to a higher expected value (so argmax will learn to copy that).
The optimality proof might be valid. I didn’t understand which specific step you thought was wrong.
So, the original statement in the paper was
It must then be the case that limt→∞Et[Ft(At)−Ft(Bt)]>η for every A∈[A],B∉[A]. Let A be the first element of [A] in A. Since every class will be seperated by at least η in the limit, sadtη(F,A) will eventually be a distribution over just [A]. And since A∼A′ for every A, A′∈[A], by the definition of soft_argmax it must be the case that limt→∞[|sadtη(F,A)t−At|]=0.
The issue with this is the last sentence. It’s basically saying “since the two actions A and A′ get equal expected utility in the limit, the total variation distance between a distribution over the two actions, and one of the actions, limits to zero”, which is false
And it is specifically disproved by the second counterexample, where there are two actions that both result in 1 utility, so they’re both in the same equivalence class, but a probabilistic mixture between them (as sadtη converges to playing, for all η) gets less than 1 utility.
Consider the following embedder. According to this embedder, you will play chicken against ADT-epsilon who knows who you are. When ADT-epsilon considers this embedder, it will always pass the reality filter, since in fact ADT-epsilon is playing against ADT-epsilon. Furthermore, this embedder gives NeverSwerveBot a high utility. So ADT-epsilon expects a high utility from this embedder, through NeverSwerveBot, and it never swerves.
You’ll have to be more specific about “who knows what you are”. If it unpacks as “opponent only uses the embedder where it is up against [whatever policy you plugged in]”, then NeverSwerveBot will have a high utility, but it will get knocked down by the reality filter, because if you converge to never swerving, Et(Ut) will converge to 0, and the inductor will learn that straight=argmaxFt=ADTt so it will converge to assigning equal expected value to F(straight) andF(ADT), and E(F(straight)) converges to 1.
If it unpacks as “opponent is ADT-epsilon”, and you converge to never swerving, then argmaxing will start duplicating the swerve strategy instead of going straight. In both cases, the argument fails.
In the original ADT paper, the agents are allowed to output distributions over moves.
The fact that we take the limit as epsilon goes to 0 means the evil problem can’t be constructed, even if randomization is not allowed. (The proof in the ADT paper doesn’t work, but that doesn’t mean something like it couldn’t possibly work)
It’s basically saying “since the two actions A and A′ get equal expected utility in the limit, the total variation distance between a distribution over the two actions, and one of the actions, limits to zero”, which is false
You’re right, this is an error in the proof, good catch.
Re chicken: The interpretation of the embedder that I meant is “opponent only uses the embedder where it is up against [whatever policy you plugged in]”. This embedder does not get knocked down by the reality filter. Let Et be the embedder. The logical inductor expects Ut to equal the crash/crash utility, and it also expects Et(⌈ADTϵ⌉) to equal the crash/crash utility. The expressions Ut and Et(⌈ADTϵ⌉) are provably equal, so of course the logical inductor expects them to be the same, and the reality check passes.
The error in your argument is that you are embedding actions rather than agents. The fact that NeverSwerveBot and ADT both provably always take the straight action does not mean the embedder assigns them equal utilities.
Wasn’t there a fairness/continuity condition in the original ADT paper that if there were two “agents” that converged to always taking the same action, then the embedder would assign them the same value? (more specifically, if Et(|At−Bt|)<δ, then Et(|Et(At)−Et(Bt)|)<ϵ ) This would mean that it’d be impossible to have Et(Et(ADTt,ϵ)) be low while Et(Et(straightt)) is high, so the argument still goes through.
Although, after this whole line of discussion, I’m realizing that there are enough substantial differences between the original formulation of ADT and the thing I wrote up that I should probably clean up this post a bit and clarify more about what’s different in the two formulations. Thanks for that.
Yes, the continuity condition on embedders in the ADT paper would eliminate the embedder I meant. Which means the answer might depend on whether ADT considers discontinuous embedders. (The importance of the continuity condition is that it is used in the optimality proof; the optimality proof can’t apply to chicken for this reason).
Agreed that the evil problem doesn’t work for the original ADT paper. In the original ADT paper, the agents are allowed to output distributions over moves. I didn’t like this because it implicitly assumes that it’s possible for the agent to perfectly randomize, and I think randomization is better modeled by a (deterministic) action that consults an environmental random-number generator, which may be correlated with other things.
What I meant was that, in the version of argmax that I set up, if A is the two constant policies “take blank box” and “take shiny box”, then for the embedder F where the opponent runs argmax to select which box to fill, the argmax agent will converge to deterministically randomizing between the two policies, by the logical inductor assigning very similar expected utility to both options such that the inductor can’t predict which action will be chosen. And this occurs because the inductor outputting more of “take the blank box” will have F(shiny) converge to a higher expected value (so argmax will learn to copy that), and the inductor outputting more of “take the shiny box” will have F(blank) converge to a higher expected value (so argmax will learn to copy that).
So, the original statement in the paper was
The issue with this is the last sentence. It’s basically saying “since the two actions A and A′ get equal expected utility in the limit, the total variation distance between a distribution over the two actions, and one of the actions, limits to zero”, which is false
And it is specifically disproved by the second counterexample, where there are two actions that both result in 1 utility, so they’re both in the same equivalence class, but a probabilistic mixture between them (as sadtη converges to playing, for all η) gets less than 1 utility.
You’ll have to be more specific about “who knows what you are”. If it unpacks as “opponent only uses the embedder where it is up against [whatever policy you plugged in]”, then NeverSwerveBot will have a high utility, but it will get knocked down by the reality filter, because if you converge to never swerving, Et(Ut) will converge to 0, and the inductor will learn that straight=argmaxFt=ADTt so it will converge to assigning equal expected value to F(straight) andF(ADT), and E(F(straight)) converges to 1.
If it unpacks as “opponent is ADT-epsilon”, and you converge to never swerving, then argmaxing will start duplicating the swerve strategy instead of going straight. In both cases, the argument fails.
The fact that we take the limit as epsilon goes to 0 means the evil problem can’t be constructed, even if randomization is not allowed. (The proof in the ADT paper doesn’t work, but that doesn’t mean something like it couldn’t possibly work)
You’re right, this is an error in the proof, good catch.
Re chicken: The interpretation of the embedder that I meant is “opponent only uses the embedder where it is up against [whatever policy you plugged in]”. This embedder does not get knocked down by the reality filter. Let Et be the embedder. The logical inductor expects Ut to equal the crash/crash utility, and it also expects Et(⌈ADTϵ⌉) to equal the crash/crash utility. The expressions Ut and Et(⌈ADTϵ⌉) are provably equal, so of course the logical inductor expects them to be the same, and the reality check passes.
The error in your argument is that you are embedding actions rather than agents. The fact that NeverSwerveBot and ADT both provably always take the straight action does not mean the embedder assigns them equal utilities.
Wasn’t there a fairness/continuity condition in the original ADT paper that if there were two “agents” that converged to always taking the same action, then the embedder would assign them the same value? (more specifically, if Et(|At−Bt|)<δ, then Et(|Et(At)−Et(Bt)|)<ϵ ) This would mean that it’d be impossible to have Et(Et(ADTt,ϵ)) be low while Et(Et(straightt)) is high, so the argument still goes through.
Although, after this whole line of discussion, I’m realizing that there are enough substantial differences between the original formulation of ADT and the thing I wrote up that I should probably clean up this post a bit and clarify more about what’s different in the two formulations. Thanks for that.
Yes, the continuity condition on embedders in the ADT paper would eliminate the embedder I meant. Which means the answer might depend on whether ADT considers discontinuous embedders. (The importance of the continuity condition is that it is used in the optimality proof; the optimality proof can’t apply to chicken for this reason).