Well, if we’re going to get historical, PPO is a relatively small variation on Williams’s REINFORCE policy gradient model-free RL algorithm from 1992 (or earlier if you count conferences etc)
Oops.
I don’t know how you can say that.
Well, I didn’t say it, TurnTrout did.
Oops.
Well, I didn’t say it, TurnTrout did.