The paradox arises because the action-optimal formula mixes world states and belief states. The [action-planning] formula essentially starts by summing up the contributions of the individual nodes as if you were an “outside” observer that knows where you are, but then calculates the probabilities at the nodes as if you were an absent-minded “inside” observer that merely believes to be there (to a degree).
So the probabilities you’re summing up are apples and oranges, so no wonder the result doesn’t make any sense. As stated, the formula for action-optimal planning is a bit like looking into your wallet more often, and then observing the exact same money more often. Seeing the same 10 dollars twice isn’t the same thing as owning 20 dollars.
If you want to calculate the utility and optimal decision probability entirely in belief-space (i.e. action-optimal), then you need to take into account that you can be at X, and already know that you’ll consider being at X again when you’re at Y.
So in belief space, your formula for the expected value also needs to take into account that you’ll forget, and the formula becomes recursive. So the formula should actually be: E=αp×E+α(1−p)×0+(1−α)p×1+(1−α)(1−p)×4
Explanation of the terms in order of appearance:
If we are in X and CONTINUE, then we will “expect the same value again” when we are in Y in the future. This enforces temporal consistency.
If we are in X and EXIT, then we should expect 0 utility
If we are in Y and CONTINUE, then we should expect 1 utility
If we are in Y and EXIT, then we should expect 4 utility We also know that a must be 1 / (1 + p), because when driving n times, you’re in X for n times, and in Y for p * n times.
Under that constraint, we get that E=−3p2+4p The optimum here is at p=2/3 with an expected utility of 4⁄3, which matches the planning-optimal formula.
[Shamelessly copied from a comment under this video by xil12323.]
The paradox arises because the action-optimal formula mixes world states and belief states.
The [action-planning] formula essentially starts by summing up the contributions of the individual nodes as if you were an “outside” observer that knows where you are, but then calculates the probabilities at the nodes as if you were an absent-minded “inside” observer that merely believes to be there (to a degree).
So the probabilities you’re summing up are apples and oranges, so no wonder the result doesn’t make any sense. As stated, the formula for action-optimal planning is a bit like looking into your wallet more often, and then observing the exact same money more often. Seeing the same 10 dollars twice isn’t the same thing as owning 20 dollars.
If you want to calculate the utility and optimal decision probability entirely in belief-space (i.e. action-optimal), then you need to take into account that you can be at X, and already know that you’ll consider being at X again when you’re at Y.
So in belief space, your formula for the expected value also needs to take into account that you’ll forget, and the formula becomes recursive. So the formula should actually be:
E=αp×E+α(1−p)×0+(1−α)p×1+(1−α)(1−p)×4
Explanation of the terms in order of appearance:
If we are in X and CONTINUE, then we will “expect the same value again” when we are in Y in the future. This enforces temporal consistency.
If we are in X and EXIT, then we should expect 0 utility
If we are in Y and CONTINUE, then we should expect 1 utility
If we are in Y and EXIT, then we should expect 4 utility We also know that a must be 1 / (1 + p), because when driving n times, you’re in X for n times, and in Y for p * n times.
Under that constraint, we get that E=−3p2+4p The optimum here is at p=2/3 with an expected utility of 4⁄3, which matches the planning-optimal formula.
[Shamelessly copied from a comment under this video by xil12323.]