You could also have a version of REINFORCE that doesn’t make the episodic assumption, where every time you get a reward, you take a policy gradient step for each of the actions taken so far, with a weight that decays as actions go further back in time. You can’t prove anything interesting about this, but you also can’t prove anything interesting about actor-critic methods that don’t have episode boundaries, I think.
Yeah, you can do this. I expect actor-critic to work better, because your suggestion is essentially a fixed model which says that actions are more relevant to temporally closer rewards (and that this is the only factor to consider).
I’m not sure how to further convey my sense that this is all very interesting. My model is that you’re like “ok sure” but don’t really see why I’m going on about this.
I’m not sure how to further convey my sense that this is all very interesting. My model is that you’re like “ok sure” but don’t really see why I’m going on about this.
Yeah, I think this is basically right. For the most part though, I’m trying to talk about things where I disagree with some (perceived) empirical claim, as opposed to the overall “but why even think about these things”—I am not surprised when it is hard to convey why things are interesting in an explicit way before the research is done.
Here, I was commenting on the perceived claim of “you need to have two-level algorithms in order to learn at all; a one-level algorithm is qualitatively different and can never succeed”, where my response is “but no, REINFORCE would do okay, though it might be more sample-inefficient”. But it seems like you aren’t claiming that, just claiming that two-level algorithms do quantitatively but not qualitatively better.
Actually, that wasn’t what I was trying to say. But, now that I think about it, I think you’re right.
I was thinking of the discounting variant of REINFORCE as having a fixed, but rather bad, model associating rewards with actions: rewards are tied more with actions nearby. So I was thinking of it as still two-level, just worse than actor-critic.
But, although the credit assignment will make mistakes (a predictable punishment which the agent can do nothing to avoid will nonetheless make any actions leading up to the punishment less likely in the future), they should average out in the long run (those ‘wrongfully punished’ actions should also be ‘wrongfully rewarded’). So it isn’t really right to think it strongly depends on the assumption.
Instead, it’s better to think of it as a true discounting function. IE, it’s not as assumption about the structure of consequences; it’s an expression of how much the system cares about distant rewards when taking an action. Under this interpretation, REINFORCE indeed “closes the gradient gap”—solves the credit assignment problem w/o restrictive modeling assumptions.
Maybe. It might also me argued that REINFORCE depends on some properties of the environment such as ergodicity. I’m not that familiar with the details.
But anyway, it now seems like a plausible counterexample.
Yeah, you can do this. I expect actor-critic to work better, because your suggestion is essentially a fixed model which says that actions are more relevant to temporally closer rewards (and that this is the only factor to consider).
I’m not sure how to further convey my sense that this is all very interesting. My model is that you’re like “ok sure” but don’t really see why I’m going on about this.
Yeah, I think this is basically right. For the most part though, I’m trying to talk about things where I disagree with some (perceived) empirical claim, as opposed to the overall “but why even think about these things”—I am not surprised when it is hard to convey why things are interesting in an explicit way before the research is done.
Here, I was commenting on the perceived claim of “you need to have two-level algorithms in order to learn at all; a one-level algorithm is qualitatively different and can never succeed”, where my response is “but no, REINFORCE would do okay, though it might be more sample-inefficient”. But it seems like you aren’t claiming that, just claiming that two-level algorithms do quantitatively but not qualitatively better.
Actually, that wasn’t what I was trying to say. But, now that I think about it, I think you’re right.
I was thinking of the discounting variant of REINFORCE as having a fixed, but rather bad, model associating rewards with actions: rewards are tied more with actions nearby. So I was thinking of it as still two-level, just worse than actor-critic.
But, although the credit assignment will make mistakes (a predictable punishment which the agent can do nothing to avoid will nonetheless make any actions leading up to the punishment less likely in the future), they should average out in the long run (those ‘wrongfully punished’ actions should also be ‘wrongfully rewarded’). So it isn’t really right to think it strongly depends on the assumption.
Instead, it’s better to think of it as a true discounting function. IE, it’s not as assumption about the structure of consequences; it’s an expression of how much the system cares about distant rewards when taking an action. Under this interpretation, REINFORCE indeed “closes the gradient gap”—solves the credit assignment problem w/o restrictive modeling assumptions.
Maybe. It might also me argued that REINFORCE depends on some properties of the environment such as ergodicity. I’m not that familiar with the details.
But anyway, it now seems like a plausible counterexample.