TurnTrout comments on Mode collapse in RL may be fueled by the update equation

TurnTrout 26 Jun 2023 17:52 UTC
LW: 2 AF: 2
0
AF
The advantage definition itself is correct and non-oscillating… Oscillating or nonconvergent value estimation is not the cause of policy mode collapse.
The advantage is (IIUC) defined with respect to a given policy, and so the advantage can oscillate and then cause mode collapse. I agree that a constant learning rate schedule is problematic, but note that ACTDE converges even with a constant learning rate schedule. So, I would indeed say that oscillating value estimation caused mode collapse in the toy example I gave?