Imagine if we had narrowed down the human prior to two possibilities, P_1 and P_2 . Humans can’t figure out which one represents our beliefs better, but the superintelligent AI will be able to figure it out. Moreover, suppose that P_2 is bad enough that it will lead to a catastrophe from the human perspective (that is, from the P_1 perspective), even if the AI were using UDT with 50-50 uncertainty between the two. Clearly, we want the AI to be updateful about which of the two hypotheses is correct.
This seems like the central argument in the post, but I don’t understand how it works.
Here’s a toy example. Two envelopes, one contains $100, the other leads to a loss of $10000. We don’t know which envelope is which, but it’s possible to figure out by a long computation. So we make a money-maximizing UDT AI, whose prior is “the $100 is in whichever envelope {long_computation} says”. Now if the AI has time to do the long computation, it’ll do it and then open the right envelope. And if it doesn’t have time to do the long computation, and is offered to open a random envelope or abstain, it will abstain. So it seems like ordinary UDT solves this example just fine. Can you explain where “updatefulness” comes in?
Naive position: UDT can’t be combined with value learning, since UDT doesn’t learn. If we’re not sure whether puppies or rainbows are what we intrinsically value, but rainbows are easier to manufacture, then the superintelligent UDT will tile the universe with rainbows instead of puppies because that has higher expectation according to the prior, regardless of evidence it encounters suggesting that puppies are what’s more valuable.
Cousin_it’s reply: There’s puppy-world and rainbow-world. In rainbow-world, tiling the universe with rainbows has 100 utility, and tiling the universe with puppies has 0 utility. In puppy-world, tiling the universe with puppies has 90 utility (because puppies are harder to maximize than rainbows), but rainbows have 0 utility.
The UDT agent gets to observe which universe it is in, although it has a 50-50 prior on the two. There are four policies:
Always make puppies: this has a 50% chance of a utility of 90, and otherwise yields zero.
EV: 45
Always make rainbows: 50% chance of utility 100, otherwise zero.
EV: 50
Make puppies in rainbow world; make rainbows in puppy world.
EV: 0
Make puppies in puppy world, make rainbows in rainbow world.
EV: 95
The highest EV is to do the obvious value-learning thing; so, there’s no problem.
Fixing the naive position: Some hypotheses will “play nice” like the example above, and updateless value learning will work fine.
However, there are some versions of “valuing puppies” and “valuing rainbows” which value puppies/rainbows regardless of which universe the puppies/rainbows are in. This only requires that there’s some sort of embedding of counterfactual information into the sigma-algebra which the utility functions are predicated on. For example, if the agent has beliefs about PA, these utility functions could check for the number of puppies/rainbows in arbitrary computations. This mostly won’t matter, because the agent doesn’t have any control over arbitrary computations; but some of the computations contemplated in Rainbow Universe will be good models of Puppy Universe. Such a rainbow-value-hypothesis will value policies which create rainbows over puppies regardless of which branch they do it in.
These utility functions are called “nosy neighbors” because they care about what happens in other realities, not just their own.
Suppose the puppy hypothesis and the rainbow hypothesis are both nosy neighbors. I’ll assume they’re nosy enough that they value puppies/rainbows in other universes exactly as much as in their own. There are four policies:
Always make puppies: 50% chance of being worthless, if the rainbow hypothesis is true. 50% of getting 90 for making puppies in puppy-universe, plus 90 more for making puppies in rainbow-universe.
EV: 90
Always make rainbows: 50% worthless, 50% worth 100 + 100.
EV: 100
Make puppies in rainbow universe, rainbows in puppy universe: 50% a value of 90, 50% a value of 100.
EV: 95
Puppies in puppy universe, rainbown in rainbow universe:
EV: 95
In the presence of nosy neighbors, the naive position is vindicated: UDT doesn’t do “value learning”.
The argument is similar for the case of ‘learning the correct prior’. The problem is that if we start with a broad prior over possible priors, then there can be nosy-neighbor hypotheses which spoil the learning. These are hard to rule out, because it is hard to rule out simulations of other possible worlds.
Going back to the envelopes example, a nosy neighbor hypothesis would be “the left envelope contains $100, even in the world where the right envelope contains $100”. Or if we have an AI that’s unsure whether it values paperclips or staples, a nosy neighbor hypothesis would be “I value paperclips, even in the world where I value staples”. I’m not sure how that makes sense. Can you give some scenario where a nosy neighbor hypothesis makes sense?
I think so, yes, but I want to note that my position is consistent with nosy-neighbor hypotheses not making sense. A big part of my point is that there’s a lot of nonsense in a broad prior. I think it’s hard to rule out the nonsense without learning. If someone thought nosy neighbors always ‘make sense’, it could be an argument against my whole position. (Because that person might be just fine with UDT, thinking that my nosy-neighbor ‘problems’ are just counterfactual muggings.)
Here’s an argument that nosy neighbors can make sense.
For values, as I mentioned, a nosy-neighbors hypothesis is a value system which cares about what happens in many different universes, not just the ‘actual’ universe. For example, a utility function which assigns some value to statements of mathematics.
For probability, a nosy-neighbor is like the Lizard World hypothesis mentioned in the post: it’s a world where what happens there depends a lot on what happens in other worlds.
I think what you wrote about staples vs paperclips nosy-neighbors is basically right, but maybe if we rephrase it it can ‘make more sense’?: “I (actual me) value paperclips being produced in the counterfactual(-from-my-perspective) world where I (counterfactual me) don’t value paperclips.”
Anyway, whether or not it makes intuitive sense, it’s mathematically fine. The idea is that a world will contain facts that are a good lens into alternative worlds (such as facts of Peano Arithmetic), which utility hypotheses / probabilistic hypotheses can care about. So although a hypothesis is only mathematically defined as a function of worlds where it holds, it “sneakily” depends on stuff that goes on in other worlds as well.
This seems like the central argument in the post, but I don’t understand how it works.
Here’s a toy example. Two envelopes, one contains $100, the other leads to a loss of $10000. We don’t know which envelope is which, but it’s possible to figure out by a long computation. So we make a money-maximizing UDT AI, whose prior is “the $100 is in whichever envelope {long_computation} says”. Now if the AI has time to do the long computation, it’ll do it and then open the right envelope. And if it doesn’t have time to do the long computation, and is offered to open a random envelope or abstain, it will abstain. So it seems like ordinary UDT solves this example just fine. Can you explain where “updatefulness” comes in?
Let’s frame it in terms of value learning.
Naive position: UDT can’t be combined with value learning, since UDT doesn’t learn. If we’re not sure whether puppies or rainbows are what we intrinsically value, but rainbows are easier to manufacture, then the superintelligent UDT will tile the universe with rainbows instead of puppies because that has higher expectation according to the prior, regardless of evidence it encounters suggesting that puppies are what’s more valuable.
Cousin_it’s reply: There’s puppy-world and rainbow-world. In rainbow-world, tiling the universe with rainbows has 100 utility, and tiling the universe with puppies has 0 utility. In puppy-world, tiling the universe with puppies has 90 utility (because puppies are harder to maximize than rainbows), but rainbows have 0 utility.
The UDT agent gets to observe which universe it is in, although it has a 50-50 prior on the two. There are four policies:
Always make puppies: this has a 50% chance of a utility of 90, and otherwise yields zero.
EV: 45
Always make rainbows: 50% chance of utility 100, otherwise zero.
EV: 50
Make puppies in rainbow world; make rainbows in puppy world.
EV: 0
Make puppies in puppy world, make rainbows in rainbow world.
EV: 95
The highest EV is to do the obvious value-learning thing; so, there’s no problem.
Fixing the naive position: Some hypotheses will “play nice” like the example above, and updateless value learning will work fine.
However, there are some versions of “valuing puppies” and “valuing rainbows” which value puppies/rainbows regardless of which universe the puppies/rainbows are in. This only requires that there’s some sort of embedding of counterfactual information into the sigma-algebra which the utility functions are predicated on. For example, if the agent has beliefs about PA, these utility functions could check for the number of puppies/rainbows in arbitrary computations. This mostly won’t matter, because the agent doesn’t have any control over arbitrary computations; but some of the computations contemplated in Rainbow Universe will be good models of Puppy Universe. Such a rainbow-value-hypothesis will value policies which create rainbows over puppies regardless of which branch they do it in.
These utility functions are called “nosy neighbors” because they care about what happens in other realities, not just their own.
Suppose the puppy hypothesis and the rainbow hypothesis are both nosy neighbors. I’ll assume they’re nosy enough that they value puppies/rainbows in other universes exactly as much as in their own. There are four policies:
Always make puppies: 50% chance of being worthless, if the rainbow hypothesis is true. 50% of getting 90 for making puppies in puppy-universe, plus 90 more for making puppies in rainbow-universe.
EV: 90
Always make rainbows: 50% worthless, 50% worth 100 + 100.
EV: 100
Make puppies in rainbow universe, rainbows in puppy universe: 50% a value of 90, 50% a value of 100.
EV: 95
Puppies in puppy universe, rainbown in rainbow universe:
EV: 95
In the presence of nosy neighbors, the naive position is vindicated: UDT doesn’t do “value learning”.
The argument is similar for the case of ‘learning the correct prior’. The problem is that if we start with a broad prior over possible priors, then there can be nosy-neighbor hypotheses which spoil the learning. These are hard to rule out, because it is hard to rule out simulations of other possible worlds.
Going back to the envelopes example, a nosy neighbor hypothesis would be “the left envelope contains $100, even in the world where the right envelope contains $100”. Or if we have an AI that’s unsure whether it values paperclips or staples, a nosy neighbor hypothesis would be “I value paperclips, even in the world where I value staples”. I’m not sure how that makes sense. Can you give some scenario where a nosy neighbor hypothesis makes sense?
I think so, yes, but I want to note that my position is consistent with nosy-neighbor hypotheses not making sense. A big part of my point is that there’s a lot of nonsense in a broad prior. I think it’s hard to rule out the nonsense without learning. If someone thought nosy neighbors always ‘make sense’, it could be an argument against my whole position. (Because that person might be just fine with UDT, thinking that my nosy-neighbor ‘problems’ are just counterfactual muggings.)
Here’s an argument that nosy neighbors can make sense.
For values, as I mentioned, a nosy-neighbors hypothesis is a value system which cares about what happens in many different universes, not just the ‘actual’ universe. For example, a utility function which assigns some value to statements of mathematics.
For probability, a nosy-neighbor is like the Lizard World hypothesis mentioned in the post: it’s a world where what happens there depends a lot on what happens in other worlds.
I think what you wrote about staples vs paperclips nosy-neighbors is basically right, but maybe if we rephrase it it can ‘make more sense’?: “I (actual me) value paperclips being produced in the counterfactual(-from-my-perspective) world where I (counterfactual me) don’t value paperclips.”
Anyway, whether or not it makes intuitive sense, it’s mathematically fine. The idea is that a world will contain facts that are a good lens into alternative worlds (such as facts of Peano Arithmetic), which utility hypotheses / probabilistic hypotheses can care about. So although a hypothesis is only mathematically defined as a function of worlds where it holds, it “sneakily” depends on stuff that goes on in other worlds as well.