Thank you! I’ve read up to and including section 4. Previously I did know a bit about neural networks, but had no experience with RL and in particular didn’t know how RL can actually bridge the gap between multiple actions leading to a sparse reward (as in: hour of Starcraft gameplay just to learn you’ve failed or won). Your article helped me realize how it is achieved—IIUC by: 0. focusing on trying to predict what the reward will be more than on maximizing it 1. using a recursive approach to infinite sum: sum(everything)=e+sum(verything). 2. by using a different neural network on the LHS than on the RHS of this equality, so that one of them can be considered “fixed” and thus not taking part in backpropagation. Cool!
Honestly, I am still a bit fuzzy on how exactly we solve the problem that most “e”s in e+sum(verything) will be zero—before reading the article I thought the secret sauce will be to have some imaginary “self-rewarding” which eventually sums up to the same total reward but in smaller increments over time (like maybe we should be happy about each vanishing pixel in Arkanoid, or each produced unit in Starcraft a little, even if there’s no direct reward for that from the environment)? After reading your explanation I have some vague intuition that even if “e” is zero, e+sum(verything) has better chance of figuring out what the future rewards will be, because we are one step closer to these rewards, and that somehow this small dent is enough to get things going. I imagine this works somewhat like this: if there are 5 keys I have to press in succession to score one point, then sooner or later I will sample the state after 4 presses and fix my predictor output for this state as the reward will be immediate result of one keypress, then I’ll be ready to learn what to do after 3 presses as I will already know what happens after 4th keypress so I can better predict this two-actions chain, and so on, and so on learning to predict consequences of a longer and longer combos. Is that right?
Things which could improve the article: 1. Could you please go over each occurrence of word “reward” in this article and make it clear if it should be “reward at step t” or “sum of rewards from step t onwards”? I felt it is really difficult for me to understand some parts of explanation because of this ambiguity. 2. In the pseudo-code box you used “r(j) + Q(s’, a’, q)” formula. I was expecting “r(j)+gamma*q(s’,a’)” instead. This is because I thought this should be similar to what we saw earlier in the math formula (the one with red,green and blue parts) and I interpreted that Q(s’,a’,theta^-) in the math formula was meant to mean something like “the network output for state s’, proposed action a’ and weights set according to old frozen settings”, which I believe should be q(s’,a’) in the pseudocode. If I am wrong, then please clarify why we don’t need to multiplication by gamma and why Q can take three arguments, and why q is one of them, as none of this is clear to me. 3. Some explanation of how this sparsity of rewards is really addressed in games which give you zero reward until the very end.
Your understanding is good. What you refer to with “self-rewarding” is called “reward shaping” in reinforcement learning. DQN doesn’t use this, so I didn’t talk about it in this post. DQN also doesn’t do well with games where you can’t stumble upon the first reward by random exploration, thus getting the process started. So, for your Point 3 - DQN doesn’t address this, but future RL solutions do.
Montezuma’s Revenge is one such example—the first reward requires traversing several obstacles correctly and is vanishingly unlikely to occur by random chance. If you look up the history of Montezuma’s Revenge in reinforcement learning you’ll find out some interesting ways people approached this problem, such as biasing the network towards states it hadn’t seen before (i.e, building “curiosity” into it). The basic DQN referred to in this post cannot solve Montezuma’s Revenge.
With your “5 keys” example, you’re almost correct. The agent needs to actually perform the entire combo at least once before it can learn to predict it in the future—the target network is looking at t+1, but that t+1 is collected from the replay buffer, so it has to have been performed at least once. So, sooner or later you’ll sample the state after 5 presses, and then be able to predict that reward in future. Then this cascades backwards over time exactly as you described.
Regarding Point 1, I’ve gone over and adjusted the terms—“future reward” now refers to the sum of rewards from step t onwards, and I’ve defined that early in Section 2, to make it clear that future reward also includes the reward from timestep t.
Regarding Point 2, you’re correct, and I’ve made those changes.
Thanks for the feedback, and I’m glad you got some useful info out of the article!
Thank you! I’ve read up to and including section 4. Previously I did know a bit about neural networks, but had no experience with RL and in particular didn’t know how RL can actually bridge the gap between multiple actions leading to a sparse reward (as in: hour of Starcraft gameplay just to learn you’ve failed or won). Your article helped me realize how it is achieved—IIUC by:
0. focusing on trying to predict what the reward will be more than on maximizing it
1. using a recursive approach to infinite sum: sum(everything)=e+sum(verything).
2. by using a different neural network on the LHS than on the RHS of this equality, so that one of them can be considered “fixed” and thus not taking part in backpropagation. Cool!
Honestly, I am still a bit fuzzy on how exactly we solve the problem that most “e”s in e+sum(verything) will be zero—before reading the article I thought the secret sauce will be to have some imaginary “self-rewarding” which eventually sums up to the same total reward but in smaller increments over time (like maybe we should be happy about each vanishing pixel in Arkanoid, or each produced unit in Starcraft a little, even if there’s no direct reward for that from the environment)?
After reading your explanation I have some vague intuition that even if “e” is zero, e+sum(verything) has better chance of figuring out what the future rewards will be, because we are one step closer to these rewards, and that somehow this small dent is enough to get things going. I imagine this works somewhat like this: if there are 5 keys I have to press in succession to score one point, then sooner or later I will sample the state after 4 presses and fix my predictor output for this state as the reward will be immediate result of one keypress, then I’ll be ready to learn what to do after 3 presses as I will already know what happens after 4th keypress so I can better predict this two-actions chain, and so on, and so on learning to predict consequences of a longer and longer combos. Is that right?
Things which could improve the article:
1. Could you please go over each occurrence of word “reward” in this article and make it clear if it should be “reward at step t” or “sum of rewards from step t onwards”? I felt it is really difficult for me to understand some parts of explanation because of this ambiguity.
2. In the pseudo-code box you used “r(j) + Q(s’, a’, q)” formula. I was expecting “r(j)+gamma*q(s’,a’)” instead. This is because I thought this should be similar to what we saw earlier in the math formula (the one with red,green and blue parts) and I interpreted that Q(s’,a’,theta^-) in the math formula was meant to mean something like “the network output for state s’, proposed action a’ and weights set according to old frozen settings”, which I believe should be q(s’,a’) in the pseudocode. If I am wrong, then please clarify why we don’t need to multiplication by gamma and why Q can take three arguments, and why q is one of them, as none of this is clear to me.
3. Some explanation of how this sparsity of rewards is really addressed in games which give you zero reward until the very end.
Your understanding is good. What you refer to with “self-rewarding” is called “reward shaping” in reinforcement learning. DQN doesn’t use this, so I didn’t talk about it in this post. DQN also doesn’t do well with games where you can’t stumble upon the first reward by random exploration, thus getting the process started. So, for your Point 3 - DQN doesn’t address this, but future RL solutions do.
Montezuma’s Revenge is one such example—the first reward requires traversing several obstacles correctly and is vanishingly unlikely to occur by random chance. If you look up the history of Montezuma’s Revenge in reinforcement learning you’ll find out some interesting ways people approached this problem, such as biasing the network towards states it hadn’t seen before (i.e, building “curiosity” into it). The basic DQN referred to in this post cannot solve Montezuma’s Revenge.
With your “5 keys” example, you’re almost correct. The agent needs to actually perform the entire combo at least once before it can learn to predict it in the future—the target network is looking at t+1, but that t+1 is collected from the replay buffer, so it has to have been performed at least once. So, sooner or later you’ll sample the state after 5 presses, and then be able to predict that reward in future. Then this cascades backwards over time exactly as you described.
Regarding Point 1, I’ve gone over and adjusted the terms—“future reward” now refers to the sum of rewards from step t onwards, and I’ve defined that early in Section 2, to make it clear that future reward also includes the reward from timestep t.
Regarding Point 2, you’re correct, and I’ve made those changes.
Thanks for the feedback, and I’m glad you got some useful info out of the article!