Notice that learning-UDT implies UDT: an agent eventually behaves as if it were applying UDT with each Pn. Therefore, in particular, it eventually behaves like UDT with prior P0. So (with the exception of some early behavior which might not conform to UDT at all) this is basically UDT with a prior which allows for learning. The prior P0 is required to eventually agree with the recommendations of P1, P2, … (which also implies that these eventually agree with each other).
I don’t understand this argument.
“an agent eventually behaves as if it were applying UDT with each Pn” — why can’t an agent skip over some Pn entirely or get stuck on P9 or whatever?
“Therefore, in particular, it eventually behaves like UDT with prior P0.” even granting the above — sure, it will beahve like UDT with prior p0 at some point. But then after that it might have some other prior. Why would it stick with P0?
I probably need to clarify the statement of the assumption.
The idea isn’t that it eventually takes at least one action that’s in line with Pn for each n, but then, might do some other stuff. The idea is that for each n, there is a time tn after which all decisions will be consistent with UDT-using-Pn.
UDT’s recommendations will often coincide with more-updateful DTs. So the learning-UDT assumption is saying that UDT eventually behaves in an updateful way with respect to each observation, although not necessarily right away upon receiving that observation.
I don’t understand this argument.
“an agent eventually behaves as if it were applying UDT with each Pn” — why can’t an agent skip over some Pn entirely or get stuck on P9 or whatever?
“Therefore, in particular, it eventually behaves like UDT with prior P0.” even granting the above — sure, it will beahve like UDT with prior p0 at some point. But then after that it might have some other prior. Why would it stick with P0?
I probably need to clarify the statement of the assumption.
The idea isn’t that it eventually takes at least one action that’s in line with Pn for each n, but then, might do some other stuff. The idea is that for each n, there is a time tn after which all decisions will be consistent with UDT-using-Pn.
UDT’s recommendations will often coincide with more-updateful DTs. So the learning-UDT assumption is saying that UDT eventually behaves in an updateful way with respect to each observation, although not necessarily right away upon receiving that observation.