“One of the problems here is that the impact penalty only looks at the value of VAR one turn ahead. In the DeepMind paper, they addressed similar issues by doing “inaction rollouts”. I’ll look at the more general situations of π0 rollouts: rollouts for any policy π0. ”
and
“That’s the counterfactual situation, that zeroes out the impact penalty. What about the actual situation? Well, as we said before, A will be just doing ∅; so, as soon as π0 would produce anything different from ∅, the A becomes completely unrestrained again.”
fit together? In the special case where π0 is the inaction policy, I don’t understand how the trick would work.
How do
“One of the problems here is that the impact penalty only looks at the value of VAR one turn ahead. In the DeepMind paper, they addressed similar issues by doing “inaction rollouts”. I’ll look at the more general situations of π0 rollouts: rollouts for any policy π0. ”
and
“That’s the counterfactual situation, that zeroes out the impact penalty. What about the actual situation? Well, as we said before, A will be just doing ∅; so, as soon as π0 would produce anything different from ∅, the A becomes completely unrestrained again.”
fit together? In the special case where π0 is the inaction policy, I don’t understand how the trick would work.
They don’t fit together in that case; that’s addressed immediately after, in section 2.3.