Kaj_Sotala comments on Is Global Reinforcement Learning (RL) a Fantasy?

Kaj_Sotala 31 Oct 2016 16:05 UTC
10 points
In my book, “reinforcement learning” has very little to do with its behaviorist origins anymore. Rather, I understand it the way it is defined in Sutton & Barto (chap. 1.1):

Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize a numerical reward signal. [...] Reinforcement learning is defined not by characterizing learning methods, but by characterizing a learning problem. Any method that is well suited to solving that problem, we consider to be a reinforcement learning method. [...]

Another key feature of reinforcement learning is that it explicitly considers the whole problem of a goal-directed agent interacting with an uncertain environment. This is in contrast with many approaches that consider subproblems without addressing how they might fit into a larger picture. For example, we have mentioned that much of machine learning research is concerned with supervised learning without explicitly specifying how such an ability would finally be useful. Other researchers have developed theories of planning with general goals, but without considering planning’s role in real-time decisionmaking, or the question of where the predictive models necessary for planning would come from. Although these approaches have yielded many useful results, their focus on isolated subproblems is a significant limitation.

Reinforcement learning takes the opposite tack, starting with a complete, interactive, goal-seeking agent. All reinforcement learning agents have explicit goals, can sense aspects of their environments, and can choose actions to influence their environments. Moreover, it is usually assumed from the beginning that the agent has to operate despite significant uncertainty about the environment it faces. When reinforcement learning involves planning, it has to address the interplay between planning and real-time action selection, as well as the question of how environmental models are acquired and improved. When reinforcement learning involves supervised learning, it does so for specific reasons that determine which capabilities are critical and which are not. For learning research to make progress, important subproblems have to be isolated and studied, but they should be subproblems that play clear roles in complete, interactive, goal-seeking agents, even if all the details of the complete agent cannot yet be filled in.

Thus your critique seems misplaced—for instance, you say that

The whole point of the RL mechanism is that the intelligent system doesn’t engage in a huge, complex, structured analysis of the situation, when it tries to decide what to do (if it did, the explanation for why the creature did what it did would be in the analysis itself, after all!). Instead, the RL people want you to believe that the RL mechanism did the heavy lifting, and that story is absolutely critical to RL. The rat simply tries a behavior at random—with no understanding of its meaning—and it is only because a reward then arrives, that the rat decides that in the future it will go press the lever again.

… but as is noted in the excerpt above, in modern RL, there’s no single “RL mechanism”—rather any method which successfully solves the reinforcement learning problem is “an RL method”. Nothing requires that method to be “try things totally at random with no understanding of their meaning” (even if that is one RL method which may be suited to some very simple situations).
- [deleted] 31 Oct 2016 17:35 UTC
  −3 points
  Parent
  This was all addressed in my essay—what you just quoted was Sutton and Barto doing exactly what I described, introducing “extra machinery” in order to get RL to work.
  
  So I already responded to you. The relevant points are these:
  
  1 -- If all that extra machinery becomes complex enough to handle real situations, it starts to become meaningless to insist that the ONLY way to choose which policy to adopt should be a single “reward” signal. Why not make the decision a complex, distributed computation? Why insist that the maximization of that one number MUST be the way the system operates? After all, a realistic learning mechanism (the extra machinery) will have thousands of components operating in collaboration with one another, and these other mechanisms will be using dozens or hundreds of internal parameters to control how they work. And then, the final result of all that cognitive apparatus is that there is a decision point where maximisation of a single number is computed, to select between all those plans?
  
  If you read what I wrote (and read the background history) you will see that psychologists considered that scenario exhaustively, and that is why they abandoned the whole paradigm. The extra machinery was where the real action was, and the imposition of that final step (deciding policy based on reward signal) became a joke. Or worse than a joke: nobody, to my knowledge, could actually get such systems to work, because the subtlety and intelligence achieved by the extra machinery would be thrown in the trash by that back-end decision.
  
  2 -- Although Sutton and Barto allow any kind of learning mechanism to be called “RL”, in practice that other stuff has never become particularly sophisticated, EXCEPT in those cases where it become totally dominant, and the researcher abandoned the reward signal. In simple terms: yes, but RL stops working when the other stuff becomes clever.
  
  Conclusion: you did not read the essay carefully enough. Your point was already covered.