So we seem to face a fundamental trade-off between the information benefits of learning (updating) and the strategic benefits of updatelessness. If I learn the digit, I will better navigate some situations which require this information, but I will lose the strategic power of coordinating with my counterfactual self, which is necessary in other situations.
It seems like we should be able to design software systems that are immune to any infohazard, including logical infohazards.
If it’s helpful to act on a piece of information you know, act on it.
If it’s not helpful to act on a piece of information you know, act as if you didn’t know it.
Ideally, we could just prove that “Decision Theory X never calculates a negative value of information”. But if needed, we could explicitly design a cognitive architecture with infohazard mitigation in mind. Some options include:
An “ignore this information in this situation” flag
Upon noticing “this information would be detrimental to act on in this situation”, we could decide to act as if we didn’t know it, in that situation.
(I think this is one of the designs you mentioned in footnote 4.)
Cognitive sandboxes
Spin up some software in a sandbox to do your thinking for you.
The software should only return logical information that is true, and useful in your current situation
If it notices any hazardous information, it simply doesn’t return it to you.
Upon noticing that a train of thought doesn’t lead to any true and useful information, don’t think about why that is and move on.
I agree with your point in footnote 4, that the hard part is knowing when to ignore information. Upon noticing that it would be helpful to ignore something, the actual ignoring seems easy.
It seems like we should be able to design software systems that are immune to any infohazard
As mentioned in another comment, I think this is not possible to solve in full generality (meaning, for all priors), because that requires complete updatelessness and we don’t want to do that.
I think all your proposed approaches are equivalent (and I think the most intuitive framing is “cognitive sandboxes”). And I think they don’t work, because of reasoning close to this paragraph:
Unfortunately, it’s not that easy, and the problem recurs at a higher level: your procedure to decide which information to use will depend on all the information, and so you will already lose strategicness. Or, if it doesn’t depend, then you are just being updateless, not using the information in any way.
But again, the problem might be solvable in particular cases (like, our prior).
The distinction between “solving the problem for our prior” and “solving the problem for all priors” definitely helps! Thank you!
I want to make sure I understand the way you’re using the term updateless, in cases where the optimal policy involves correlating actions with observations. Like pushing a red button upon seeing a red light, but pushing a blue button upon seeing a blue light. It seems like (See Red → Push Red, See Blue → Push Blue) is the policy that CDT, EDT, and UDT would all implement.
In the way that I understand the terms, CDT and EDT are updateful procedures, and UDT is updateless. And all three are able to use information available to them. It’s just that an updateless decision procedure always handles information in ways that are endorsed a priori. (True information can degrade the performance of updateful decision theories, but updatelessness implies infohazard immunity.)
Is this consistent with the way you’re describing decision-making procedures as updateful and updateless?
It also seems like if an agent is regarding some information as hazardous, that agent isn’t being properly updateless with respect to that information. In particular, if it finds that it’s afraid to learn true information about other agents (such as their inclinations and pre-commitments), it already knows that it will mishandle that information upon learning it. And if it were properly updateless, it would handle that information properly.
It seems like we can use that “flinching away from true information” as a signal that we’d like to change the way our future self will handle learning that information. If our software systems ever notice themselves calculating a negative value of information for an observation (empirical or logical), the details of that calculation will reveal at least one counterfactual branch where they’re mishandling that information. It seems like we should always be able to automatically patch that part of our policy, possibly using a commitment that binds our future self.
In the worst case, we should always be able to do what our ignorant self would have done, so information should never hurt us.
Is this consistent with the way you’re describing decision-making procedures as updateful and updateless?
Absolutely. A good implementation of UDT can, from its prior, decide on an updateful strategy. It’s just it won’t be able to change its mind about which updateful strategy seems best. See this comment for more.
“flinching away from true information”
As mentioned also in that comment, correct implementations of UDT don’t actually flinch away from information: they just decide ex ante (when still not having access to that information) whether or not they will let their future actions depend on it.
The problem remains though: you make the ex ante call about which information to “decision-relevantly update on”, and this can be a wrong call, and this creates commitment races, etc.
The problem remains though: you make the ex ante call about which information to “decision-relevantly update on”, and this can be a wrong call, and this creates commitment races, etc.
My understanding is that commitment races only occur in cases where “information about the commitments made by other agents” has negative value for all relevant agents. (All agents are racing to commit before learning more, which might scare them away from making such a commitment.)
It seems like updateless agents should not find themselves in commitment races.
My impression is that we don’t have a satisfactory extension of UDT to multi-agent interactions. But I suspect that the updateless response to observing “your counterpart has committed to going Straight” will look less like “Swerve, since that’s the best response” and more like “go Straight with enough probability that your counterpart wishes they’d coordinated with you rather than trying to bully you.”
Offering to coordinate on socially optimal outcomes, and being willing to pay costs to discourage bullying, seems like a generalizable way for smart agents to achieve good outcomes.
It seems like we should be able to design software systems that are immune to any infohazard, including logical infohazards.
If it’s helpful to act on a piece of information you know, act on it.
If it’s not helpful to act on a piece of information you know, act as if you didn’t know it.
Ideally, we could just prove that “Decision Theory X never calculates a negative value of information”. But if needed, we could explicitly design a cognitive architecture with infohazard mitigation in mind. Some options include:
An “ignore this information in this situation” flag
Upon noticing “this information would be detrimental to act on in this situation”, we could decide to act as if we didn’t know it, in that situation.
(I think this is one of the designs you mentioned in footnote 4.)
Cognitive sandboxes
Spin up some software in a sandbox to do your thinking for you.
The software should only return logical information that is true, and useful in your current situation
If it notices any hazardous information, it simply doesn’t return it to you.
Upon noticing that a train of thought doesn’t lead to any true and useful information, don’t think about why that is and move on.
I agree with your point in footnote 4, that the hard part is knowing when to ignore information. Upon noticing that it would be helpful to ignore something, the actual ignoring seems easy.
As mentioned in another comment, I think this is not possible to solve in full generality (meaning, for all priors), because that requires complete updatelessness and we don’t want to do that.
I think all your proposed approaches are equivalent (and I think the most intuitive framing is “cognitive sandboxes”). And I think they don’t work, because of reasoning close to this paragraph:
But again, the problem might be solvable in particular cases (like, our prior).
The distinction between “solving the problem for our prior” and “solving the problem for all priors” definitely helps! Thank you!
I want to make sure I understand the way you’re using the term updateless, in cases where the optimal policy involves correlating actions with observations. Like pushing a red button upon seeing a red light, but pushing a blue button upon seeing a blue light. It seems like (See Red → Push Red, See Blue → Push Blue) is the policy that CDT, EDT, and UDT would all implement.
In the way that I understand the terms, CDT and EDT are updateful procedures, and UDT is updateless. And all three are able to use information available to them. It’s just that an updateless decision procedure always handles information in ways that are endorsed a priori. (True information can degrade the performance of updateful decision theories, but updatelessness implies infohazard immunity.)
Is this consistent with the way you’re describing decision-making procedures as updateful and updateless?
It also seems like if an agent is regarding some information as hazardous, that agent isn’t being properly updateless with respect to that information. In particular, if it finds that it’s afraid to learn true information about other agents (such as their inclinations and pre-commitments), it already knows that it will mishandle that information upon learning it. And if it were properly updateless, it would handle that information properly.
It seems like we can use that “flinching away from true information” as a signal that we’d like to change the way our future self will handle learning that information. If our software systems ever notice themselves calculating a negative value of information for an observation (empirical or logical), the details of that calculation will reveal at least one counterfactual branch where they’re mishandling that information. It seems like we should always be able to automatically patch that part of our policy, possibly using a commitment that binds our future self.
In the worst case, we should always be able to do what our ignorant self would have done, so information should never hurt us.
Absolutely. A good implementation of UDT can, from its prior, decide on an updateful strategy. It’s just it won’t be able to change its mind about which updateful strategy seems best. See this comment for more.
As mentioned also in that comment, correct implementations of UDT don’t actually flinch away from information: they just decide ex ante (when still not having access to that information) whether or not they will let their future actions depend on it.
The problem remains though: you make the ex ante call about which information to “decision-relevantly update on”, and this can be a wrong call, and this creates commitment races, etc.
My understanding is that commitment races only occur in cases where “information about the commitments made by other agents” has negative value for all relevant agents. (All agents are racing to commit before learning more, which might scare them away from making such a commitment.)
It seems like updateless agents should not find themselves in commitment races.
My impression is that we don’t have a satisfactory extension of UDT to multi-agent interactions. But I suspect that the updateless response to observing “your counterpart has committed to going Straight” will look less like “Swerve, since that’s the best response” and more like “go Straight with enough probability that your counterpart wishes they’d coordinated with you rather than trying to bully you.”
Offering to coordinate on socially optimal outcomes, and being willing to pay costs to discourage bullying, seems like a generalizable way for smart agents to achieve good outcomes.