Sorry if I should have misunderstood the point of your post, but I’m surprised that Bellman’s optimality equation was nowhere mentioned. From Sutton’s book on the topic I understood that once the policy iteration of vanilla RL converged to the point that the BOE holds, the agent is maximizing “value”, which I would define in words as something like “expectation of discounted and cumulated reward”. Now before one turns off a student new to the topic by giving a precise definition of those terms right away, I can see why he might have contracted that a bit unfortunately to “a numerical reward signal”. I don’t feel competent to comment how the picture is complicated in deep RL by the fact that the value function might be learned only approximately. But it doesn’t seem too farfetched to me that the agent will still end up maximizing a “value”, where maybe the notion of expectation needs to be modified a bit.
Sorry if I should have misunderstood the point of your post, but I’m surprised that Bellman’s optimality equation was nowhere mentioned. From Sutton’s book on the topic I understood that once the policy iteration of vanilla RL converged to the point that the BOE holds, the agent is maximizing “value”, which I would define in words as something like “expectation of discounted and cumulated reward”. Now before one turns off a student new to the topic by giving a precise definition of those terms right away, I can see why he might have contracted that a bit unfortunately to “a numerical reward signal”.
I don’t feel competent to comment how the picture is complicated in deep RL by the fact that the value function might be learned only approximately. But it doesn’t seem too farfetched to me that the agent will still end up maximizing a “value”, where maybe the notion of expectation needs to be modified a bit.