If by “coinciding for decisions that are in the support” you mean what I think that means, then that’s true re: actions that never happen, but it’s not clear why actions that never happen should influence your assessment of how a decision theory works. Implicitly when you do anything probabilistic you assume that sets of null measure can be thrown away without changing anything.
Issue is you need to actually condition on the actions that never happen to decide what their expected utility would be, which is necessary to decide not to take them.
I don’t think this is a real world problem, because you can just do some kind of relaxation by adding random noise to your actions and then let the standard deviation go to zero. In practice there aren’t perfectly deterministic systems anyway.
It’s likely that some strategy like that also works in theory & has already been worked out by someone, but in any event it doesn’t seem like a serious obstacle unless the “renormalization” ends up being dependent on which procedure you pick, which seems unlikely.
I think epsilon-exploration is done for different reasons, but there are a bunch of cases in which “add some noise and then let the noise go to zero” is a viable strategy to solve problems. Here it’s done mainly to sidestep an issue of “dividing by zero”, which makes me think that there’s some kind of argument which sidesteps it by using limits or something like that. It feels similar to what happens when you try to divide by zero when differentiating a function.
The RL case is different and is more reminiscent of e.g. simulated annealing, where adding noise to an optimization procedure and letting the noise tend to zero over time improves performance compared to a more greedy approach. I don’t think these are quite the same thing as what’s happening with the EDT situation here, it seems to me like an application of the same technique for quite different purposes.
Here it’s done mainly to sidestep an issue of “dividing by zero”, which makes me think that there’s some kind of argument which sidesteps it by using limits or something like that.
Issue is you need to actually condition on the actions that never happen to decide what their expected utility would be, which is necessary to decide not to take them.
I don’t think this is a real world problem, because you can just do some kind of relaxation by adding random noise to your actions and then let the standard deviation go to zero. In practice there aren’t perfectly deterministic systems anyway.
It’s likely that some strategy like that also works in theory & has already been worked out by someone, but in any event it doesn’t seem like a serious obstacle unless the “renormalization” ends up being dependent on which procedure you pick, which seems unlikely.
This is called epsilon-exploration in RL.
I think epsilon-exploration is done for different reasons, but there are a bunch of cases in which “add some noise and then let the noise go to zero” is a viable strategy to solve problems. Here it’s done mainly to sidestep an issue of “dividing by zero”, which makes me think that there’s some kind of argument which sidesteps it by using limits or something like that. It feels similar to what happens when you try to divide by zero when differentiating a function.
The RL case is different and is more reminiscent of e.g. simulated annealing, where adding noise to an optimization procedure and letting the noise tend to zero over time improves performance compared to a more greedy approach. I don’t think these are quite the same thing as what’s happening with the EDT situation here, it seems to me like an application of the same technique for quite different purposes.
Here’s my attempt at sidestepping: EDT solves 5 and 10 with conditional oracles.