for example a Bayesian agent might have a high prior that an otherwise non-interventionist God will reward them after death for not eating apples, and therefore not eat apples throughout their life.
Yeah, this is an important point, but I think UDT has it significantly worse. For one thing, UDT has the problem I mention on top of the problem you mention. But more importantly, I think the problem I mention is less tractable than the problem you mention.
EDIT: I’ve edited the essay to name my problem as “lizard worlds” (lizards reward updateful policies). So I’ll call the issue you raise the heaven/hell problem, and the issue I raise the lizard world problem.
For updateful DT, we can at least say: yes, a broad prior will include heaven/hell hypotheses which dramatically impact policy choice. But updateful priors have tools to address this problem:
The prior includes a likelihood function for heaven/hell hypotheses, which specifies how the probability of such hypotheses gets adjusted in light of evidence.
We mostly[1] trust simplicity priors to either make sensible likelihood functions, which will only lean towards heaven/hell hypotheses when there’s good reason, or else penalize heaven/hell hypotheses a priori for having a higher description complexity.
We can also directly provide feedback about the value estimates to teach an updateful DT to have sensible expectations.[2]
None of these methods help UDT address the lizard world problem:
The likelihood functions don’t matter; only the prior probability matters.
Simplicity priors aren’t especially going to rule out these alternative worlds.[3]
Direct feedback we give about expected values doesn’t reduce the prior weight of these problematic hypotheses.
So I think there are definitely problems in this area, but I’m not sure it has much to do with “learning” as opposed to “philosophy” and the examples / thought experiments you give don’t seem to pump my intuition in that direction much. (How UDT works in iterated counterfactual mugging also seems fine to me.)
Yeah, I expect the thought experiment I start with is only going to be compelling to people who sort of already agree with me.
I do agree that “philosophy” problems are very close to this stuff, and it would be good to articulate in those terms.
This is somewhat nuanced/questionable. The lizards might be providing rewards/punishments for behavior in lots of worlds, not just Earth, so that this hypothesis doesn’t have to point to Earth specifically. However, if utilities are bounded, then this arguably weakens the rewards/punishments relevant to Earth, which is similarly reassuring to giving this hypothesis less prior weight.
Perhaps it doesn’t have to weaken the rewards/punishments relevant to Earth, though, if lizards reward only those who always reject counterfactual muggings in all other worlds (not including the Lizard’s offer, of course, which is arguably a counterfactual mugging itself).
Also, I think there are more complications related to inner optimizers.
Yeah, this is an important point, but I think UDT has it significantly worse. For one thing, UDT has the problem I mention on top of the problem you mention. But more importantly, I think the problem I mention is less tractable than the problem you mention.
EDIT: I’ve edited the essay to name my problem as “lizard worlds” (lizards reward updateful policies). So I’ll call the issue you raise the heaven/hell problem, and the issue I raise the lizard world problem.
For updateful DT, we can at least say: yes, a broad prior will include heaven/hell hypotheses which dramatically impact policy choice. But updateful priors have tools to address this problem:
The prior includes a likelihood function for heaven/hell hypotheses, which specifies how the probability of such hypotheses gets adjusted in light of evidence.
We mostly[1] trust simplicity priors to either make sensible likelihood functions, which will only lean towards heaven/hell hypotheses when there’s good reason, or else penalize heaven/hell hypotheses a priori for having a higher description complexity.
We can also directly provide feedback about the value estimates to teach an updateful DT to have sensible expectations.[2]
None of these methods help UDT address the lizard world problem:
The likelihood functions don’t matter; only the prior probability matters.
Simplicity priors aren’t especially going to rule out these alternative worlds.[3]
Direct feedback we give about expected values doesn’t reduce the prior weight of these problematic hypotheses.
Yeah, I expect the thought experiment I start with is only going to be compelling to people who sort of already agree with me.
I do agree that “philosophy” problems are very close to this stuff, and it would be good to articulate in those terms.
Modulo inner-optimizer concerns like simulation attacks.
I’m imagining something like Bayesian RL, or Bayesian approval-directed agents, but perhaps with the twist that feedback is only sometimes given.
This is somewhat nuanced/questionable. The lizards might be providing rewards/punishments for behavior in lots of worlds, not just Earth, so that this hypothesis doesn’t have to point to Earth specifically. However, if utilities are bounded, then this arguably weakens the rewards/punishments relevant to Earth, which is similarly reassuring to giving this hypothesis less prior weight.
Perhaps it doesn’t have to weaken the rewards/punishments relevant to Earth, though, if lizards reward only those who always reject counterfactual muggings in all other worlds (not including the Lizard’s offer, of course, which is arguably a counterfactual mugging itself).
Also, I think there are more complications related to inner optimizers.