I interpret you as making the claim (across this and your other recent posts): don’t expect policies to get a low loss just because they were selected for getting a low loss, instead think about how SGD steps will shape what they are “trying” to do and use that to reason directly about their generalization behavior.
Yeah… I interpret TurnTrout as saying “look I know it seems straightforward to say that we are optimizing over policies rather than building policies that optimize for reward, but actually this difference is incredibly subtle”. And I think he’s right that this exact point has the kind of subtlety that just keeps biting again and again. I have the sense that this distinction held up evolutionary biology for decades.
Nevertheless, yes, as you say, the question is how to in fact reason from “policies selected according to such-and-such loss” to “any guarantees whatsoever about general behavior of policy”. I wish we could say more about why this part of the problem is so ferociously difficult.
Yeah… I interpret TurnTrout as saying “look I know it seems straightforward to say that we are optimizing over policies rather than building policies that optimize for reward, but actually this difference is incredibly subtle”. And I think he’s right that this exact point has the kind of subtlety that just keeps biting again and again. I have the sense that this distinction held up evolutionary biology for decades.
Nevertheless, yes, as you say, the question is how to in fact reason from “policies selected according to such-and-such loss” to “any guarantees whatsoever about general behavior of policy”. I wish we could say more about why this part of the problem is so ferociously difficult.