Okay, I just spent a couple days on this and here’s my best take on Counterfactual Mugging with a logical coin. It mostly agrees with your conclusions, but makes them more precise in interesting ways. I’ll use good old proof search because it makes things clearer.
Let’s say the agent uses short proofs, the predictor uses medium proofs that are enough to fully simulate the agent (and that fact itself has a short proof), and figuring out the value of the coin requires a long proof (and that fact itself might require a long proof).
def World():
if Coin()=True and Agent()=True:
return -100
else if Coin()=False and there's a medium proof that "Coin()=True implies Agent()=True":
return 10000
else:
return 0
def Agent():
iterate through all short proofs of the form
"if Agent()=X, then Coin()=True implies World()=Y and Coin()=False implies World()=Z"
return the X corresponding to highest Y+Z found
Obviously there’s a short proof that the agent returning True would lead to utility 9900, because medium proofs are enough to fully simulate the agent. Can the agent find some better proof corresponding to False, and return False as a result? That’s only possible if there’s a medium proof that “Coin()=True implies Agent()=True”. But if the agent returns False, there’s a medium proof that “Agent()=False”. The two can be combined into a medium proof that “Coin()=False”, which is impossible by assumption. So the agent returns True, QED.
Seems like a victory for updatelessness, right? But my formalization has a huge problem. Only we know the coin is hard, the agent doesn’t know that. It can imagine a world where “Coin()=False” has a medium proof, so “Coin()=True implies Agent()=True” has a medium proof for all agents, so the predictor rewards all agents. Ruling out that possibility is at least as hard as proving bounded consistency up to medium proofs, so it’s out of reach for any agent relying on short proofs only. Basically the agent can’t build a full picture of counterfactuals. It successfully returns True, but only because it can’t figure out what happens in case of False.
That, to me, is the main problem with Counterfactual Mugging and logical coins in general. If an agent is too weak to prove the coin’s value, it’s certainly too weak to prove that the coin is hard. The agent can’t rule out the possibility that the answer is just around the corner, so it can’t treat the coin as “distant” from counterfactuals about its own reasoning.
Switching from proof search to logical induction probably won’t help. The essential difficulty will remain: using easy reasoning to build the full counterfactual picture requires knowing that the coin won’t accidentally fall to easy reasoning, and you cannot know that even if it’s true.
Interesting! I’m not seeing why this phenomenon would remain when switching to LI, though. Can’t you guess, based on cryptographic assumptions or simple experience, that a particular logical coin is hard? If the situation is set up so that you can’t guess this, aren’t you right in thinking it might be easy and reasoning as such (giving some probability to the predictor figuring out the coin)?
Maybe you’re right and logical induction can quickly become confident that the coin is hard. In that case, encoding “updatelessness” seems easy. The agent should spend a small fixed amount of time choosing a successor program with high expected utility according to LI, and then run that program. That seems to solve both betting on a logical coin (where it’s best for the successor to compute the coin) and counterfactual mugging (where it’s best for the successor to pay up). Though we don’t know what shape the successor will take in general, and you could say figuring that out is the task of decision theory...
Another problem with this version of “updatelessness” is that if you only run the logical inductor a short amount of time before selecting the policy, the chosen policy could be terrible, since the early belief state of the inductor could be quite bad. It seems as if there’s at least something to be said about what it means to make a “good” trade-off between running too long before choosing the policy (so not being updateless enough) and running too short (so not knowing enough to choose policies wisely).
Okay, I just spent a couple days on this and here’s my best take on Counterfactual Mugging with a logical coin. It mostly agrees with your conclusions, but makes them more precise in interesting ways. I’ll use good old proof search because it makes things clearer.
Let’s say the agent uses short proofs, the predictor uses medium proofs that are enough to fully simulate the agent (and that fact itself has a short proof), and figuring out the value of the coin requires a long proof (and that fact itself might require a long proof).
def World():
if Coin()=True and Agent()=True:
return -100
else if Coin()=False and there's a medium proof that "Coin()=True implies Agent()=True":
return 10000
else:
return 0
def Agent():
iterate through all short proofs of the form
"if Agent()=X, then Coin()=True implies World()=Y and Coin()=False implies World()=Z"
return the X corresponding to highest Y+Z found
Obviously there’s a short proof that the agent returning True would lead to utility 9900, because medium proofs are enough to fully simulate the agent. Can the agent find some better proof corresponding to False, and return False as a result? That’s only possible if there’s a medium proof that “Coin()=True implies Agent()=True”. But if the agent returns False, there’s a medium proof that “Agent()=False”. The two can be combined into a medium proof that “Coin()=False”, which is impossible by assumption. So the agent returns True, QED.
Seems like a victory for updatelessness, right? But my formalization has a huge problem. Only we know the coin is hard, the agent doesn’t know that. It can imagine a world where “Coin()=False” has a medium proof, so “Coin()=True implies Agent()=True” has a medium proof for all agents, so the predictor rewards all agents. Ruling out that possibility is at least as hard as proving bounded consistency up to medium proofs, so it’s out of reach for any agent relying on short proofs only. Basically the agent can’t build a full picture of counterfactuals. It successfully returns True, but only because it can’t figure out what happens in case of False.
That, to me, is the main problem with Counterfactual Mugging and logical coins in general. If an agent is too weak to prove the coin’s value, it’s certainly too weak to prove that the coin is hard. The agent can’t rule out the possibility that the answer is just around the corner, so it can’t treat the coin as “distant” from counterfactuals about its own reasoning.
Switching from proof search to logical induction probably won’t help. The essential difficulty will remain: using easy reasoning to build the full counterfactual picture requires knowing that the coin won’t accidentally fall to easy reasoning, and you cannot know that even if it’s true.
Interesting! I’m not seeing why this phenomenon would remain when switching to LI, though. Can’t you guess, based on cryptographic assumptions or simple experience, that a particular logical coin is hard? If the situation is set up so that you can’t guess this, aren’t you right in thinking it might be easy and reasoning as such (giving some probability to the predictor figuring out the coin)?
Maybe you’re right and logical induction can quickly become confident that the coin is hard. In that case, encoding “updatelessness” seems easy. The agent should spend a small fixed amount of time choosing a successor program with high expected utility according to LI, and then run that program. That seems to solve both betting on a logical coin (where it’s best for the successor to compute the coin) and counterfactual mugging (where it’s best for the successor to pay up). Though we don’t know what shape the successor will take in general, and you could say figuring that out is the task of decision theory...
Another problem with this version of “updatelessness” is that if you only run the logical inductor a short amount of time before selecting the policy, the chosen policy could be terrible, since the early belief state of the inductor could be quite bad. It seems as if there’s at least something to be said about what it means to make a “good” trade-off between running too long before choosing the policy (so not being updateless enough) and running too short (so not knowing enough to choose policies wisely).