UDT, in its global policy form, is trying to solve two problems:
(1) coordination between the instances of an agent faced with
alternative environments; and
(2) not losing interest in counterfactuals as soon as observations
contradict them.
I think that in practice, UDT is a wrong approach to problem (1),
and the way in which it solves problem (2) obscures the nature of that
problem.
Coordination, achieved with UDT, is like using identical agents to get
cooperation in PD.
Already in simple use cases we have different amounts of computational
resources for instances of the UDT agent that could make the decision
processes different, hence workarounds with keeping track of how much
computation to give the decision processes, so that coordination
doesn’t break, or hierarchies of decision processes that can access
more and more resources.
Even worse, the instances could be working on different problems and
don’t need to coordinate at the level of computational resources needed
to work on these problems.
But we know that cooperation is possible in much greater generality,
even between unrelated agents, and I think this is the right way of
handling the differences between the instances of an agent.
It’s useful to restate the problem of not ignoring counterfactuals, as
a problem of preserving values.
It’s not quite reflective stability, as it’s stability under external
observations rather than reflection, but when an agent plans for
future observations it can change itself to preserve its values when the
observations happen (hence “Son of CDT” that one-boxes).
One issue is that the resulting values are still not right, they
ignore counterfactuals that are not in the future of where the
self-modification took place, and it’s even less clear how self-modification
addresses computational uncertainty.
So the problem is not just preserving values, but formulating them in the
first place so that they can already talk about counterfactuals and
computational resources.
I think that in the first approximation, the thing in common between
instances of an agent (within a world, between alternative worlds,
and at different times) should be a fixed definition of values, while the
decision algorithms should be allowed to be different and to coordinate
with each other as unrelated agents would.
This requires an explanation of what kind of thing values are,
their semantics, so that the same values
(1) can be interpreted in unrelated situations to guide decisions, including
worlds that don’t have our physical laws, and by agents that don’t know the
physical laws of the situations they inhabit, but
(2) retain valuation of all the other situations, which should in particular
motivate acausal coordination as an instrumental drive.
Each one of these points is relatively straightforward to address, but
not together.
I’m utterly confused about this problem, and I think it deserves more
attention.
UDT, in its global policy form, is trying to solve two problems: (1) coordination between the instances of an agent faced with alternative environments; and (2) not losing interest in counterfactuals as soon as observations contradict them. I think that in practice, UDT is a wrong approach to problem (1), and the way in which it solves problem (2) obscures the nature of that problem.
Coordination, achieved with UDT, is like using identical agents to get cooperation in PD. Already in simple use cases we have different amounts of computational resources for instances of the UDT agent that could make the decision processes different, hence workarounds with keeping track of how much computation to give the decision processes, so that coordination doesn’t break, or hierarchies of decision processes that can access more and more resources. Even worse, the instances could be working on different problems and don’t need to coordinate at the level of computational resources needed to work on these problems. But we know that cooperation is possible in much greater generality, even between unrelated agents, and I think this is the right way of handling the differences between the instances of an agent.
It’s useful to restate the problem of not ignoring counterfactuals, as a problem of preserving values. It’s not quite reflective stability, as it’s stability under external observations rather than reflection, but when an agent plans for future observations it can change itself to preserve its values when the observations happen (hence “Son of CDT” that one-boxes). One issue is that the resulting values are still not right, they ignore counterfactuals that are not in the future of where the self-modification took place, and it’s even less clear how self-modification addresses computational uncertainty. So the problem is not just preserving values, but formulating them in the first place so that they can already talk about counterfactuals and computational resources.
I think that in the first approximation, the thing in common between instances of an agent (within a world, between alternative worlds, and at different times) should be a fixed definition of values, while the decision algorithms should be allowed to be different and to coordinate with each other as unrelated agents would. This requires an explanation of what kind of thing values are, their semantics, so that the same values (1) can be interpreted in unrelated situations to guide decisions, including worlds that don’t have our physical laws, and by agents that don’t know the physical laws of the situations they inhabit, but (2) retain valuation of all the other situations, which should in particular motivate acausal coordination as an instrumental drive. Each one of these points is relatively straightforward to address, but not together. I’m utterly confused about this problem, and I think it deserves more attention.
It seems to me like cooperation might be possible in much greater generality. I don’t see how we know that it is possible. Please explain?
I’m having trouble following you here. Can you explain more about each point, and how they can be addresses separately?