This was all addressed in my essay—what you just quoted was Sutton and Barto doing exactly what I described, introducing “extra machinery” in order to get RL to work.
So I already responded to you. The relevant points are these:
1 -- If all that extra machinery becomes complex enough to handle real situations, it starts to become meaningless to insist that the ONLY way to choose which policy to adopt should be a single “reward” signal. Why not make the decision a complex, distributed computation? Why insist that the maximization of that one number MUST be the way the system operates? After all, a realistic learning mechanism (the extra machinery) will have thousands of components operating in collaboration with one another, and these other mechanisms will be using dozens or hundreds of internal parameters to control how they work. And then, the final result of all that cognitive apparatus is that there is a decision point where maximisation of a single number is computed, to select between all those plans?
If you read what I wrote (and read the background history) you will see that psychologists considered that scenario exhaustively, and that is why they abandoned the whole paradigm. The extra machinery was where the real action was, and the imposition of that final step (deciding policy based on reward signal) became a joke. Or worse than a joke: nobody, to my knowledge, could actually get such systems to work, because the subtlety and intelligence achieved by the extra machinery would be thrown in the trash by that back-end decision.
2 -- Although Sutton and Barto allow any kind of learning mechanism to be called “RL”, in practice that other stuff has never become particularly sophisticated, EXCEPT in those cases where it become totally dominant, and the researcher abandoned the reward signal. In simple terms: yes, but RL stops working when the other stuff becomes clever.
Conclusion: you did not read the essay carefully enough. Your point was already covered.
This was all addressed in my essay—what you just quoted was Sutton and Barto doing exactly what I described, introducing “extra machinery” in order to get RL to work.
So I already responded to you. The relevant points are these:
1 -- If all that extra machinery becomes complex enough to handle real situations, it starts to become meaningless to insist that the ONLY way to choose which policy to adopt should be a single “reward” signal. Why not make the decision a complex, distributed computation? Why insist that the maximization of that one number MUST be the way the system operates? After all, a realistic learning mechanism (the extra machinery) will have thousands of components operating in collaboration with one another, and these other mechanisms will be using dozens or hundreds of internal parameters to control how they work. And then, the final result of all that cognitive apparatus is that there is a decision point where maximisation of a single number is computed, to select between all those plans?
If you read what I wrote (and read the background history) you will see that psychologists considered that scenario exhaustively, and that is why they abandoned the whole paradigm. The extra machinery was where the real action was, and the imposition of that final step (deciding policy based on reward signal) became a joke. Or worse than a joke: nobody, to my knowledge, could actually get such systems to work, because the subtlety and intelligence achieved by the extra machinery would be thrown in the trash by that back-end decision.
2 -- Although Sutton and Barto allow any kind of learning mechanism to be called “RL”, in practice that other stuff has never become particularly sophisticated, EXCEPT in those cases where it become totally dominant, and the researcher abandoned the reward signal. In simple terms: yes, but RL stops working when the other stuff becomes clever.
Conclusion: you did not read the essay carefully enough. Your point was already covered.