I probably need to clarify the statement of the assumption.
The idea isn’t that it eventually takes at least one action that’s in line with Pn for each n, but then, might do some other stuff. The idea is that for each n, there is a time tn after which all decisions will be consistent with UDT-using-Pn.
UDT’s recommendations will often coincide with more-updateful DTs. So the learning-UDT assumption is saying that UDT eventually behaves in an updateful way with respect to each observation, although not necessarily right away upon receiving that observation.
I probably need to clarify the statement of the assumption.
The idea isn’t that it eventually takes at least one action that’s in line with Pn for each n, but then, might do some other stuff. The idea is that for each n, there is a time tn after which all decisions will be consistent with UDT-using-Pn.
UDT’s recommendations will often coincide with more-updateful DTs. So the learning-UDT assumption is saying that UDT eventually behaves in an updateful way with respect to each observation, although not necessarily right away upon receiving that observation.