I like the philosophical and strategic take here: let’s avoid wireheading, arbitrary reinforcement strength is risky[1], hopefully we can get some values-caring-about-human-stuff.
The ACTDE seems potentially a nice complement/alternative to entropy[2] regularisation for avoiding mode collapse (I haven’t evaluated deeply). I think you’re misdiagnosing a few things though.
Overall I think the section about oscillating advantage/value estimation is irrelevant (interesting, but unrelated), and I think you should point the finger less at PPO and advantage estimation per se and more at exploration at large. And you might want to flag that too much exploration/randomness can also be an issue!
Though note that ideally, once we actually know with confidence what is best, we should be near-greedy about it, rather than softmaxing! Say it was ‘ice cream’ vs ‘slap in the face’. I would infinitely (linearly in time) regret softmaxing over that for eternity. As it stands I think humanity is very far from being able to safely aggressively greedily optimise really important things, but this is at least a consideration to keep in mind.
Incidentally, KL divergence regularisation is not primarily for avoiding mode collapse AFAIK, it’s for approximate trust region constraints—which may incidentally help to avoid mode collapse by penalising large jumps away from initially-high-entropy policies. See the TRPO paper. Entropy regularisation directly addresses mode collapse.
this kind of failure happens by default in policy gradient methods.
It looks like you’re kind of agreeing here that value estimate oscillation isn’t the culprit? Again I think this is pretty standard—though the finger is usually not pointed at any particular value estimator or whatnot, but rather at the greediness of updating only on so-far-observed data i.e. the exploration problem. The GLIE conditions[1] - Greedy in the Limit with Infinite Exploration—are a classic result. Hence the plethora of exploration techniques which are researched and employed in RL.
Techniques like confidence bounding[2] based on Hoeffding’s inequality and Thompson sampling based on Bayesian uncertainty require more than a simple mean estimate (which is all that a value or advantage is): typically at least also one spread/uncertainty estimate[3]. Entropy regularisation, epsilon ‘exploration’, intrinsic ‘curiosity’ rewards, value-of-information estimation and so on are all heuristics for engaging with exploration.
Epsilon exploration can get away without a spread estimate, but its GLIE guarantees are only provided if there’s an epsilon per state, which secretly smuggles in an uncertainty estimate (because you’re tracking the progress bar on each state somehow, which means you’re tracking how often you’ve seen it).
Though note that ideally, once we actually know with confidence what is best, we should be near-greedy about it, rather than softmaxing!
I disagree. I don’t view reward/reinforcement as indicating what is “best” (from our perspective), but as chiseling decision-making circuitry into the AI (which may then decide what is “best” from its perspective). One way of putting a related point: I think that we don’t need to infinitely reinforce a line of reasoning in order to train an AI which reasons correctly.
(I want to check—does this response make sense to you? Happy to try explaining my intuition in another way.)
There’s also the issue of non-ergodic/nonstationary environments (if I try out breaking my leg to see what happens, I might not be able to try out other stuff later!) which defeat the GLIE and can cause another kind of collapse. Actually behaving sufficiently entropically is risky in such environments, hence research into safe exploration.
The problem is that this advantage can oscillate forever.
This is a pretty standard point in RL textbooks. But the culprit is the learning rate (which you set to be 1 in the example, but you can construct a nonconverging case for any constant α)! The advantage definition itself is correct and non-oscillating, it’s the estimation of the expectation using a moving average which is (sometimes) at fault.
Oscillating or nonconvergent value estimation is not the cause of policy mode collapse.
The advantage definition itself is correct and non-oscillating… Oscillating or nonconvergent value estimation is not the cause of policy mode collapse.
The advantage is (IIUC) defined with respect to a given policy, and so the advantage can oscillate and then cause mode collapse. I agree that a constant learning rate schedule is problematic, but note that ACTDE converges even with a constant learning rate schedule. So, I would indeed say that oscillating value estimation caused mode collapse in the toy example I gave?
I like the philosophical and strategic take here: let’s avoid wireheading, arbitrary reinforcement strength is risky[1], hopefully we can get some values-caring-about-human-stuff.
The ACTDE seems potentially a nice complement/alternative to entropy[2] regularisation for avoiding mode collapse (I haven’t evaluated deeply). I think you’re misdiagnosing a few things though.
Overall I think the section about oscillating advantage/value estimation is irrelevant (interesting, but unrelated), and I think you should point the finger less at PPO and advantage estimation per se and more at exploration at large. And you might want to flag that too much exploration/randomness can also be an issue!
Though note that ideally, once we actually know with confidence what is best, we should be near-greedy about it, rather than softmaxing! Say it was ‘ice cream’ vs ‘slap in the face’. I would infinitely (linearly in time) regret softmaxing over that for eternity. As it stands I think humanity is very far from being able to safely aggressively greedily optimise really important things, but this is at least a consideration to keep in mind.
Incidentally, KL divergence regularisation is not primarily for avoiding mode collapse AFAIK, it’s for approximate trust region constraints—which may incidentally help to avoid mode collapse by penalising large jumps away from initially-high-entropy policies. See the TRPO paper. Entropy regularisation directly addresses mode collapse.
It looks like you’re kind of agreeing here that value estimate oscillation isn’t the culprit? Again I think this is pretty standard—though the finger is usually not pointed at any particular value estimator or whatnot, but rather at the greediness of updating only on so-far-observed data i.e. the exploration problem. The GLIE conditions[1] - Greedy in the Limit with Infinite Exploration—are a classic result. Hence the plethora of exploration techniques which are researched and employed in RL.
Techniques like confidence bounding[2] based on Hoeffding’s inequality and Thompson sampling based on Bayesian uncertainty require more than a simple mean estimate (which is all that a value or advantage is): typically at least also one spread/uncertainty estimate[3]. Entropy regularisation, epsilon ‘exploration’, intrinsic ‘curiosity’ rewards, value-of-information estimation and so on are all heuristics for engaging with exploration.
I don’t know what’s a good resource on GLIE, but you can just look up Greedy in the Limit with Infinite Exploration
Amazingly there’s no Wikipedia entry on UCB??
Epsilon exploration can get away without a spread estimate, but its GLIE guarantees are only provided if there’s an epsilon per state, which secretly smuggles in an uncertainty estimate (because you’re tracking the progress bar on each state somehow, which means you’re tracking how often you’ve seen it).
I disagree. I don’t view reward/reinforcement as indicating what is “best” (from our perspective), but as chiseling decision-making circuitry into the AI (which may then decide what is “best” from its perspective). One way of putting a related point: I think that we don’t need to infinitely reinforce a line of reasoning in order to train an AI which reasons correctly.
(I want to check—does this response make sense to you? Happy to try explaining my intuition in another way.)
There’s also the issue of non-ergodic/nonstationary environments (if I try out breaking my leg to see what happens, I might not be able to try out other stuff later!) which defeat the GLIE and can cause another kind of collapse. Actually behaving sufficiently entropically is risky in such environments, hence research into safe exploration.
This is a pretty standard point in RL textbooks. But the culprit is the learning rate (which you set to be 1 in the example, but you can construct a nonconverging case for any constant α)! The advantage definition itself is correct and non-oscillating, it’s the estimation of the expectation using a moving average which is (sometimes) at fault.
Oscillating or nonconvergent value estimation is not the cause of policy mode collapse.
The advantage is (IIUC) defined with respect to a given policy, and so the advantage can oscillate and then cause mode collapse. I agree that a constant learning rate schedule is problematic, but note that ACTDE converges even with a constant learning rate schedule. So, I would indeed say that oscillating value estimation caused mode collapse in the toy example I gave?