It sounds like there was a misunderstanding somewhere—I’m aware that all of this would be learned; my point is that the learned policy contains an optimizer rather than being an optimizer, which seems like a significant point, and your original definition sounded like you wanted the learned policy to be an optimizer.
It sounds like there was a misunderstanding somewhere—I’m aware that all of this would be learned; my point is that the learned policy contains an optimizer rather than being an optimizer, which seems like a significant point, and your original definition sounded like you wanted the learned policy to be an optimizer.