Unfortunately, on-policy expected information gain goes to 0 pretty fast (Theorem 5 here).
Where’s the “pretty fast”? The theorem makes a claim in the limit and says nothing about convergence. (I haven’t read the rest of the paper.)
Oh yeah sorry that isn’t shown there. But I believe the sum over all timesteps of the m-step expected info gain at each timestep is finite w.p.1 which would make it o(1/t) w.p.1.
Where’s the “pretty fast”? The theorem makes a claim in the limit and says nothing about convergence. (I haven’t read the rest of the paper.)
Oh yeah sorry that isn’t shown there. But I believe the sum over all timesteps of the m-step expected info gain at each timestep is finite w.p.1 which would make it o(1/t) w.p.1.