I have an idea for reasoning about counterpossibles for decision theory. I’m pretty skeptical that it’s correct, because it doesn’t seem that hard to come up with. Still, I can’t see a problem with it, and I would very much appreciate feedback.
This paper provides a method of describing UDP using proof-based counterpossibles. However, it doesn’t work on stochastic environments. I will describe a new system that is intended to fix this. The technique seems sufficiently straightforward to come up with that I suspect I’m either doing something wrong or this has already been thought of, so I’m interested in feedback.
In the system described in the paper, the algorithm sees if Peano Arithmetic proves an agent outputting action a would result in the environment reaching outcome a, and then picks whichever has a provable outcome that has utility at least as high as all the other provable outcomes.
My proposed modification is to instead first have a fixed system of estimating the expected utility after conditioning on the agent taking action a and for every utility u, try to prove that the estimation system would output that the expected utility of the agent be u. Then take the action such that maximizes the provable expected utility estimates of the estimation system.
I will now provide more detail of the estimation system. I remember reading about an extension of Solomonoff induction that allowed it to access halting oracles. This isn’t computable, so instead imagine a system that uses some approximation of the extension of Solomonoff induction in which logical induction or some more powerful technique is used to approximate the halting oracles, with one exception. The exception is the answer to the logical question “my program, in the current circumstances, outputs x”, which would by taken to be true whenever the AI is considering the implications of it taking action x. Then, expected utility can be calculated by using the probability estimates provided by the system.
Now, I’ll describe it in code. Let |E()| represent a Godel encoding of of the function describing the AI’s world model and |A()| represent a Godel encoding of the agent’s output. Let approximate_expected_utility(|E()|, a) be some algorithm that computes some reasonable approximation of the expected utility after conditioning on the agent taking action a. Let x represent a dequote. Let eeus be a dictionary. Here I’m assuming there are finitely many possible utilities.
function UDT(|E()|, |A()|):
eeus = {}
for utility in utilities:
for action in actions: # actions are Godel-encoded
if PA proves |approximate_expected_utility(|E()|, |A()| = ^action^)| = utility:
eeus[action] = utility
return the key in eeus that maximizes eeus(key)
This gets around the problem in the original algorithm provided, because the original algorithm couldn’t prove anything about the utility in a world with indexical uncertainty, so my system instead proves something about a fixed probabilistic approximation.
Note that this still doesn’t specify a method of specifying counterpossibles about what would happen if an agent took a certain action when it clearly wouldn’., For example, if an agent has a decision algorithm of “output a, unconditionally”, then this doesn’t provide a method of explaining what would happen if it outputted something else. The paper listed this as a concern about the method it provided, too. However, I don’t see why it matters. If an agent has the decision algorithm “action = a”, then what’s even the point of considering what would happen if it outputted b? It’s not like it’s ever going to happen.
I have an idea for reasoning about counterpossibles for decision theory. I’m pretty skeptical that it’s correct, because it doesn’t seem that hard to come up with. Still, I can’t see a problem with it, and I would very much appreciate feedback.
This paper provides a method of describing UDP using proof-based counterpossibles. However, it doesn’t work on stochastic environments. I will describe a new system that is intended to fix this. The technique seems sufficiently straightforward to come up with that I suspect I’m either doing something wrong or this has already been thought of, so I’m interested in feedback.
In the system described in the paper, the algorithm sees if Peano Arithmetic proves an agent outputting action a would result in the environment reaching outcome a, and then picks whichever has a provable outcome that has utility at least as high as all the other provable outcomes.
My proposed modification is to instead first have a fixed system of estimating the expected utility after conditioning on the agent taking action a and for every utility u, try to prove that the estimation system would output that the expected utility of the agent be u. Then take the action such that maximizes the provable expected utility estimates of the estimation system.
I will now provide more detail of the estimation system. I remember reading about an extension of Solomonoff induction that allowed it to access halting oracles. This isn’t computable, so instead imagine a system that uses some approximation of the extension of Solomonoff induction in which logical induction or some more powerful technique is used to approximate the halting oracles, with one exception. The exception is the answer to the logical question “my program, in the current circumstances, outputs x”, which would by taken to be true whenever the AI is considering the implications of it taking action x. Then, expected utility can be calculated by using the probability estimates provided by the system.
Now, I’ll describe it in code. Let |E()| represent a Godel encoding of of the function describing the AI’s world model and |A()| represent a Godel encoding of the agent’s output. Let approximate_expected_utility(|E()|, a) be some algorithm that computes some reasonable approximation of the expected utility after conditioning on the agent taking action a. Let x represent a dequote. Let eeus be a dictionary. Here I’m assuming there are finitely many possible utilities.
This gets around the problem in the original algorithm provided, because the original algorithm couldn’t prove anything about the utility in a world with indexical uncertainty, so my system instead proves something about a fixed probabilistic approximation.
Note that this still doesn’t specify a method of specifying counterpossibles about what would happen if an agent took a certain action when it clearly wouldn’., For example, if an agent has a decision algorithm of “output a, unconditionally”, then this doesn’t provide a method of explaining what would happen if it outputted something else. The paper listed this as a concern about the method it provided, too. However, I don’t see why it matters. If an agent has the decision algorithm “action = a”, then what’s even the point of considering what would happen if it outputted b? It’s not like it’s ever going to happen.