Could you explain what the monotonicity principle is, without referring to any symbols or operators?
The loss function of a physicalist agent depends on which computational facts are physically manifest (roughly speaking, which computations the universe runs), and on the computational reality itself (the outputs of computations). The monotonicity principle requires it to be non-decreasing w.r.t. the manifesting of less facts. Roughly speaking, the more computations the universe runs, the better.
This is odd, because it implies that the total destruction of the universe is always the worst possible outcome. And, the creation of an additional, causally disconnected, world can never be net-negative. For a monotonic agent, there can be no net-negative world[1]. In particular, for selfish monotonic agents (such that only assign value to their own observations), this means death is the worst possible outcome and the creation of additional copies of the agent can never be net-negative.
With all the new notation, I forgot what everything meant after the first time they were defined.
Well, there are the “notation” and “notation reference” subsections, that might help.
That being said, I appreciate all the work you put into this. I can tell there’s important stuff to glean here.
At least, all of this is true if we ignore the dependence of the loss function on the other argument, namely the outputs of computations. But it seems like that doesn’t qualitatively change the picture.
Thank you for explaining this! But then how can this framework be used to model humans as agents? People can easily imagine outcomes worse than death or destruction of the universe.
The long answer is, here are some possibilities, roughly ordered from “boring” to “weird”:
The framework is wrong.
The framework is incomplete, there is some extension which gets rid of monotonicity. There are some obvious ways to make such extensions, but they look uglier and without further research it’s hard to say whether they break important things or not.
Humans are just not physicalist agents, you’re not supposed to model them using this framework, even if this framework can be useful for AI. This is why humans took so much time coming up with science.
Like #3, and also if we thought long enough we would become convinced of some kind of simulation/deity hypothesis (where the simulator/deity is a physicalist), and this is normatively correct for us.
Because the universe is effectively finite (since it’s asymptotically de Sitter), there are only so many computations that can run. Therefore, even if you only assign positive value to running certain computations, it effectively implies that running other computations is bad. Moreover, the fact the universe is finite is unsurprising since infinite universes tend to have all possible computations running which makes them roughly irrelevant hypotheses for a physicalist.
We are just confused about hell being worse than death. For example, maybe people in hell have no qualia. This makes some sense if you endorse the (natural for physicalists) anthropic theory that only the best-off future copy of you matters. You can imagine there always being a “dead copy” of you, so that if something worst-than-death happens to the apparent-you, your subjective experiences go into the “dead copy”.
The monotonicity principle requires it to be non-decreasing w.r.t. the manifesting of less facts. Roughly speaking, the more computations the universe runs, the better.
I think this is what I was missing. Thanks.
So, then, the monotonicity principle sets a baseline for the agent’s loss function that corresponds to how much less stuff can happen to whatever subset of the universe it cares about, getting worse the fewer opportunities become available, due to death or some other kind of stifling. Then the agent’s particular value function over universe-states gets added/subtracted on top of that, correct?
No, it’s not a baseline, it’s just an inequality. Let’s do a simple example. Suppose the agent is selfish and cares only about (i) the experience of being in a red room and (ii) the experience of being in a green room. And, let’s suppose these are the only two possible experiences, it can’t experience going from a room in one color to a room in another color or anything like that (for example, because the agent has no memory). Denote G the program corresponding to “the agent deciding on an action after it sees a green room” and R the program corresponding to “the agent deciding on an action after it sees a red room”. Then, roughly speaking[1], there are 4 possibilities:
α∅: The universe runs neither R nor G.
αR: The universe runs R but not G.
αG: The universe runs G but not R.
αRG: The universe runs both R and G.
In this case, the monotonicity principle imposes the following inequalities on the loss function L:
L(α∅)≥L(αR)L(α∅)≥L(αG)L(αR)≥L(αRG)L(αG)≥L(αRG)
That is, α∅ must be the worst case and αRG must be the best case.
In fact, manifesting of computational facts doesn’t amount to selecting a set of realized programs, because programs can be entangled with each other, but let’s ignore this for simplicity’s sake.
Okay, so it’s just a constraint on the final shape of the loss function. Would you construct such a loss function by integrating a strictly non-positive computation-value function over all of space and time (or at least over the future light-cones of all its copies, if it focuses just on the effects of its own behavior)?
Space and time are not really the right parameters here, since these refer to Φ (physical states), not Γ (computational “states”) or 2Γ (physically manifest facts about computations). In the example above, it doesn’t matter where the (copy of the) agent is when it sees the red room, only the fact the agent does see it. We could construct such a loss function by a sum over programs, but the constructions suggested in section 3 use minimum instead of sum, since this seems like a less “extreme” choice in some sense. Ofc ultimately the loss function is subjective: as long as the monotonicity principle is obeyed, the agent is free to have any loss function.
The loss function of a physicalist agent depends on which computational facts are physically manifest (roughly speaking, which computations the universe runs), and on the computational reality itself (the outputs of computations). The monotonicity principle requires it to be non-decreasing w.r.t. the manifesting of less facts. Roughly speaking, the more computations the universe runs, the better.
This is odd, because it implies that the total destruction of the universe is always the worst possible outcome. And, the creation of an additional, causally disconnected, world can never be net-negative. For a monotonic agent, there can be no net-negative world[1]. In particular, for selfish monotonic agents (such that only assign value to their own observations), this means death is the worst possible outcome and the creation of additional copies of the agent can never be net-negative.
Well, there are the “notation” and “notation reference” subsections, that might help.
Thank you!
At least, all of this is true if we ignore the dependence of the loss function on the other argument, namely the outputs of computations. But it seems like that doesn’t qualitatively change the picture.
Thank you for explaining this! But then how can this framework be used to model humans as agents? People can easily imagine outcomes worse than death or destruction of the universe.
The short answer is, I don’t know.
The long answer is, here are some possibilities, roughly ordered from “boring” to “weird”:
The framework is wrong.
The framework is incomplete, there is some extension which gets rid of monotonicity. There are some obvious ways to make such extensions, but they look uglier and without further research it’s hard to say whether they break important things or not.
Humans are just not physicalist agents, you’re not supposed to model them using this framework, even if this framework can be useful for AI. This is why humans took so much time coming up with science.
Like #3, and also if we thought long enough we would become convinced of some kind of simulation/deity hypothesis (where the simulator/deity is a physicalist), and this is normatively correct for us.
Because the universe is effectively finite (since it’s asymptotically de Sitter), there are only so many computations that can run. Therefore, even if you only assign positive value to running certain computations, it effectively implies that running other computations is bad. Moreover, the fact the universe is finite is unsurprising since infinite universes tend to have all possible computations running which makes them roughly irrelevant hypotheses for a physicalist.
We are just confused about hell being worse than death. For example, maybe people in hell have no qualia. This makes some sense if you endorse the (natural for physicalists) anthropic theory that only the best-off future copy of you matters. You can imagine there always being a “dead copy” of you, so that if something worst-than-death happens to the apparent-you, your subjective experiences go into the “dead copy”.
I think this is what I was missing. Thanks.
So, then, the monotonicity principle sets a baseline for the agent’s loss function that corresponds to how much less stuff can happen to whatever subset of the universe it cares about, getting worse the fewer opportunities become available, due to death or some other kind of stifling. Then the agent’s particular value function over universe-states gets added/subtracted on top of that, correct?
No, it’s not a baseline, it’s just an inequality. Let’s do a simple example. Suppose the agent is selfish and cares only about (i) the experience of being in a red room and (ii) the experience of being in a green room. And, let’s suppose these are the only two possible experiences, it can’t experience going from a room in one color to a room in another color or anything like that (for example, because the agent has no memory). Denote G the program corresponding to “the agent deciding on an action after it sees a green room” and R the program corresponding to “the agent deciding on an action after it sees a red room”. Then, roughly speaking[1], there are 4 possibilities:
α∅: The universe runs neither R nor G.
αR: The universe runs R but not G.
αG: The universe runs G but not R.
αRG: The universe runs both R and G.
In this case, the monotonicity principle imposes the following inequalities on the loss function L:
L(α∅)≥L(αR) L(α∅)≥L(αG) L(αR)≥L(αRG) L(αG)≥L(αRG)
That is, α∅ must be the worst case and αRG must be the best case.
In fact, manifesting of computational facts doesn’t amount to selecting a set of realized programs, because programs can be entangled with each other, but let’s ignore this for simplicity’s sake.
Okay, so it’s just a constraint on the final shape of the loss function. Would you construct such a loss function by integrating a strictly non-positive computation-value function over all of space and time (or at least over the future light-cones of all its copies, if it focuses just on the effects of its own behavior)?
Space and time are not really the right parameters here, since these refer to Φ (physical states), not Γ (computational “states”) or 2Γ (physically manifest facts about computations). In the example above, it doesn’t matter where the (copy of the) agent is when it sees the red room, only the fact the agent does see it. We could construct such a loss function by a sum over programs, but the constructions suggested in section 3 use minimum instead of sum, since this seems like a less “extreme” choice in some sense. Ofc ultimately the loss function is subjective: as long as the monotonicity principle is obeyed, the agent is free to have any loss function.