But if UDT starts with a broad prior, it will probably not learn, because it will have some weird stuff in its prior which causes it to obey random imperatives from imaginary Gods.
Are you suggesting that this is a unique problem for UDT, or affects it more than other decision theories? It seems like Bayesian decision theories can have the same problem, for example a Bayesian agent might have a high prior that an otherwise non-interventionist God will reward them after death for not eating apples, and therefore not eat apples throughout their life. How is this different in principle from UDT refraining from paying the counterfactual mugger in your scenario to get reward from God in the other branch? Why wouldn’t this problem be solved automatically given “good” or “reasonable” priors (whatever that means), which presumably would assign such gods low probabilities to begin with?
Interlocutor: The prior is subjective. An agent has no choice but to trust its own prior. From its own perspective, its prior is the most accurate description of reality it can articulate.
I wouldn’t say this, because I’m not sure that the prior is subjective. From my current perspective I would say that it is part of the overall project of philosophy to figure out the nature of our priors and the contents of what they should be (if they’re not fully subjective or have some degree of normativity).
So I think there are definitely problems in this area, but I’m not sure it has much to do with “learning” as opposed to “philosophy” and the examples / thought experiments you give don’t seem to pump my intuition in that direction much. (How UDT works in iterated counterfactual mugging also seems fine to me.)
for example a Bayesian agent might have a high prior that an otherwise non-interventionist God will reward them after death for not eating apples, and therefore not eat apples throughout their life.
Yeah, this is an important point, but I think UDT has it significantly worse. For one thing, UDT has the problem I mention on top of the problem you mention. But more importantly, I think the problem I mention is less tractable than the problem you mention.
EDIT: I’ve edited the essay to name my problem as “lizard worlds” (lizards reward updateful policies). So I’ll call the issue you raise the heaven/hell problem, and the issue I raise the lizard world problem.
For updateful DT, we can at least say: yes, a broad prior will include heaven/hell hypotheses which dramatically impact policy choice. But updateful priors have tools to address this problem:
The prior includes a likelihood function for heaven/hell hypotheses, which specifies how the probability of such hypotheses gets adjusted in light of evidence.
We mostly[1] trust simplicity priors to either make sensible likelihood functions, which will only lean towards heaven/hell hypotheses when there’s good reason, or else penalize heaven/hell hypotheses a priori for having a higher description complexity.
We can also directly provide feedback about the value estimates to teach an updateful DT to have sensible expectations.[2]
None of these methods help UDT address the lizard world problem:
The likelihood functions don’t matter; only the prior probability matters.
Simplicity priors aren’t especially going to rule out these alternative worlds.[3]
Direct feedback we give about expected values doesn’t reduce the prior weight of these problematic hypotheses.
So I think there are definitely problems in this area, but I’m not sure it has much to do with “learning” as opposed to “philosophy” and the examples / thought experiments you give don’t seem to pump my intuition in that direction much. (How UDT works in iterated counterfactual mugging also seems fine to me.)
Yeah, I expect the thought experiment I start with is only going to be compelling to people who sort of already agree with me.
I do agree that “philosophy” problems are very close to this stuff, and it would be good to articulate in those terms.
This is somewhat nuanced/questionable. The lizards might be providing rewards/punishments for behavior in lots of worlds, not just Earth, so that this hypothesis doesn’t have to point to Earth specifically. However, if utilities are bounded, then this arguably weakens the rewards/punishments relevant to Earth, which is similarly reassuring to giving this hypothesis less prior weight.
Perhaps it doesn’t have to weaken the rewards/punishments relevant to Earth, though, if lizards reward only those who always reject counterfactual muggings in all other worlds (not including the Lizard’s offer, of course, which is arguably a counterfactual mugging itself).
Also, I think there are more complications related to inner optimizers.
FWIW, in our original formulation of open-minded updatelessness, the idea was about revising the prior via either
-becoming aware of new possibilities (thereby changing the support of the prior); -”philosophy”, i.e., reflecting on principles for prior-setting. (Which would allow for the use of an “objective” prior, if we ended up thinking there was such a thing.)
Are you suggesting that this is a unique problem for UDT, or affects it more than other decision theories? It seems like Bayesian decision theories can have the same problem, for example a Bayesian agent might have a high prior that an otherwise non-interventionist God will reward them after death for not eating apples, and therefore not eat apples throughout their life. How is this different in principle from UDT refraining from paying the counterfactual mugger in your scenario to get reward from God in the other branch? Why wouldn’t this problem be solved automatically given “good” or “reasonable” priors (whatever that means), which presumably would assign such gods low probabilities to begin with?
I wouldn’t say this, because I’m not sure that the prior is subjective. From my current perspective I would say that it is part of the overall project of philosophy to figure out the nature of our priors and the contents of what they should be (if they’re not fully subjective or have some degree of normativity).
So I think there are definitely problems in this area, but I’m not sure it has much to do with “learning” as opposed to “philosophy” and the examples / thought experiments you give don’t seem to pump my intuition in that direction much. (How UDT works in iterated counterfactual mugging also seems fine to me.)
Yeah, this is an important point, but I think UDT has it significantly worse. For one thing, UDT has the problem I mention on top of the problem you mention. But more importantly, I think the problem I mention is less tractable than the problem you mention.
EDIT: I’ve edited the essay to name my problem as “lizard worlds” (lizards reward updateful policies). So I’ll call the issue you raise the heaven/hell problem, and the issue I raise the lizard world problem.
For updateful DT, we can at least say: yes, a broad prior will include heaven/hell hypotheses which dramatically impact policy choice. But updateful priors have tools to address this problem:
The prior includes a likelihood function for heaven/hell hypotheses, which specifies how the probability of such hypotheses gets adjusted in light of evidence.
We mostly[1] trust simplicity priors to either make sensible likelihood functions, which will only lean towards heaven/hell hypotheses when there’s good reason, or else penalize heaven/hell hypotheses a priori for having a higher description complexity.
We can also directly provide feedback about the value estimates to teach an updateful DT to have sensible expectations.[2]
None of these methods help UDT address the lizard world problem:
The likelihood functions don’t matter; only the prior probability matters.
Simplicity priors aren’t especially going to rule out these alternative worlds.[3]
Direct feedback we give about expected values doesn’t reduce the prior weight of these problematic hypotheses.
Yeah, I expect the thought experiment I start with is only going to be compelling to people who sort of already agree with me.
I do agree that “philosophy” problems are very close to this stuff, and it would be good to articulate in those terms.
Modulo inner-optimizer concerns like simulation attacks.
I’m imagining something like Bayesian RL, or Bayesian approval-directed agents, but perhaps with the twist that feedback is only sometimes given.
This is somewhat nuanced/questionable. The lizards might be providing rewards/punishments for behavior in lots of worlds, not just Earth, so that this hypothesis doesn’t have to point to Earth specifically. However, if utilities are bounded, then this arguably weakens the rewards/punishments relevant to Earth, which is similarly reassuring to giving this hypothesis less prior weight.
Perhaps it doesn’t have to weaken the rewards/punishments relevant to Earth, though, if lizards reward only those who always reject counterfactual muggings in all other worlds (not including the Lizard’s offer, of course, which is arguably a counterfactual mugging itself).
Also, I think there are more complications related to inner optimizers.
FWIW, in our original formulation of open-minded updatelessness, the idea was about revising the prior via either
-becoming aware of new possibilities (thereby changing the support of the prior);
-”philosophy”, i.e., reflecting on principles for prior-setting. (Which would allow for the use of an “objective” prior, if we ended up thinking there was such a thing.)
(Cool post, Abraham, thanks!)