Yes, most of the algorithms in use today are known to converge or roughly converge to optimizing per-episode rewards. In most cases it’s relatively clear that there is no optimization across episode boundaries (by the outer optimizer).
Yes, most of the algorithms in use today are known to converge or roughly converge to optimizing per-episode rewards. In most cases it’s relatively clear that there is no optimization across episode boundaries (by the outer optimizer).