It sounds like you considered a more general setting than I am an the moment. I want to eventually move to that kind of “combined outcome” setting, but first, I want to understand more classical preference structures and break things one at a time.
Do you think your version sheds any light on value learning in UDT? I had a discussion with Alex Appel about this, in which it seemed like you have a “nosy neighbors” problem, where a potential set of values may care about what happens even in worlds where different values hold; but, this problem seemed to be bounded by such other-world preferences acting like beliefs. For example, you could imagine a UDT agent with world-models in which either vegetarianism or carnivorism are right (which somehow make different predictions). Each set of preferences can either be “nosy” (cares what happens regardless of which facts end up true) or “non-nosy” (each preference set only cares about what happens in their own world—vegetarianism cares about the amount of meat eaten in veg-world, and carnivorism cares about amount of meat eaten in carn-world).
The claim which seemed plausible was that nosiness has some kind of balancing behavior which acts like probability: putting some of your caring measure on other worlds reduces your caring measure on your own.
By nosy preferences, do you mean something like this?
“I am grateful to Zeus for telling me that cows have feelings. Now I know that, even if Zeus had told me that cows are unfeeling brutes, eating them would still be wrong.”
But that just seems irrational and not worth modeling. Or do you have some other kind of situation in mind?
Pretty much that, actually. It doesn’t seem too irrational, though. Upon looking at a mathematical universe where torture was decided upon as a good thing, it isn’t an obvious failure of rationality to hope that a cosmic ray flips the sign bit of the utility function of an agent in there.
The practical problem with values that care about other mathematical worlds, however, is that if the agent you built has a UDT prior over values, it’s an improvement (from the perspective of the prior) for the nosy neigbors/values that care about other worlds, to dictate some of what happens in your world (since the marginal contribution of your world to the prior expected utility looks like some linear combination of the various utility functions, weighted by how much they care about your world) So, in practice, it’d be a bad idea to build a UDT value learning prior containing utility functions that have preferences over all worlds, since it’d add a bunch of extra junk from different utility functions to our world if run.
“I’m grateful to HAL for telling me that cows have feelings. Now I’m pretty sure that, even if HAL had a glitch and mistakenly told me that cows are devoid of feeling, eating them would still be wrong.”
That’s valid reasoning. The right way to formalize it is to have two worlds, one where eating cows is okay and another where eating cows is not okay, without any “nosy preferences”. Then you receive probabilistic evidence about which world you’re in, and deal with it in the usual way.
I’m not clear on whether it is rational or not. It seems like behavior we don’t want from a value learner, but I was curious about how “inevitable” it is from attempts to mix updatelessness with value learning. (Perhaps it is a really simple point, but I haven’t thought it entirely through, still.)
It sounds like you considered a more general setting than I am an the moment. I want to eventually move to that kind of “combined outcome” setting, but first, I want to understand more classical preference structures and break things one at a time.
Do you think your version sheds any light on value learning in UDT? I had a discussion with Alex Appel about this, in which it seemed like you have a “nosy neighbors” problem, where a potential set of values may care about what happens even in worlds where different values hold; but, this problem seemed to be bounded by such other-world preferences acting like beliefs. For example, you could imagine a UDT agent with world-models in which either vegetarianism or carnivorism are right (which somehow make different predictions). Each set of preferences can either be “nosy” (cares what happens regardless of which facts end up true) or “non-nosy” (each preference set only cares about what happens in their own world—vegetarianism cares about the amount of meat eaten in veg-world, and carnivorism cares about amount of meat eaten in carn-world).
The claim which seemed plausible was that nosiness has some kind of balancing behavior which acts like probability: putting some of your caring measure on other worlds reduces your caring measure on your own.
Anything structurally similar in your framework?
By nosy preferences, do you mean something like this?
“I am grateful to Zeus for telling me that cows have feelings. Now I know that, even if Zeus had told me that cows are unfeeling brutes, eating them would still be wrong.”
But that just seems irrational and not worth modeling. Or do you have some other kind of situation in mind?
Pretty much that, actually. It doesn’t seem too irrational, though. Upon looking at a mathematical universe where torture was decided upon as a good thing, it isn’t an obvious failure of rationality to hope that a cosmic ray flips the sign bit of the utility function of an agent in there.
The practical problem with values that care about other mathematical worlds, however, is that if the agent you built has a UDT prior over values, it’s an improvement (from the perspective of the prior) for the nosy neigbors/values that care about other worlds, to dictate some of what happens in your world (since the marginal contribution of your world to the prior expected utility looks like some linear combination of the various utility functions, weighted by how much they care about your world) So, in practice, it’d be a bad idea to build a UDT value learning prior containing utility functions that have preferences over all worlds, since it’d add a bunch of extra junk from different utility functions to our world if run.
Are you talking about something like this?
“I’m grateful to HAL for telling me that cows have feelings. Now I’m pretty sure that, even if HAL had a glitch and mistakenly told me that cows are devoid of feeling, eating them would still be wrong.”
That’s valid reasoning. The right way to formalize it is to have two worlds, one where eating cows is okay and another where eating cows is not okay, without any “nosy preferences”. Then you receive probabilistic evidence about which world you’re in, and deal with it in the usual way.
I’m not clear on whether it is rational or not. It seems like behavior we don’t want from a value learner, but I was curious about how “inevitable” it is from attempts to mix updatelessness with value learning. (Perhaps it is a really simple point, but I haven’t thought it entirely through, still.)
I have a recent result about value learning in UDT, it turns out to work very nicely and doesn’t suffer from the problem you describe.