I’m a bit confused about what you’re proposing. AlphaZero has an input (board state) and an output (move). Are you proposing to call this input-output function “a policy”?
If so, sure we can say that, but I think people would find it confusing—because there’s a tree search in between the input and output, and one ingredient of the tree search is the “policy network” (or maybe just “policy head”, I forget), but here the relation between the “policy network” and the final input-output function is very indirect, such that it seems odd to use (almost) the same term for them.
In my head, a policy is just a situation-dependent way of acting. Sometimes that way of acting makes use of foresight, sometimes that way of acting is purely reflexive. I mentally file the AlphaZero policy network + tree search combination as a “policy”, one separate from the “reactive policy” defined by just using the policy network without tree search. Looking back at Sutton & Barto, they define “policy” similarly:
A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. It corresponds to what in psychology would be called a set of stimulus–response rules or associations. In some cases the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process. The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic, specifying probabilities for each action.
(emphasis mine) along with this later description of planning in a model-based RL context:
The word planning is used in several different ways in different fields. We use the term to refer to any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment
which seems compatible with thinking of planning algorithms like MCTS as components of an improved policy at runtime (not just in training).
That being said, looking at the AlphaZero paper, a quick search did not turn up usages of the term “policy” in this way. So maybe this usage is less widespread than I had assumed.
In my head, a policy is just a situation-dependent way of acting. Sometimes that way of acting makes use of foresight, sometimes that way of acting is purely reflexive. I mentally file the AlphaZero policy network + tree search combination as a “policy”, one separate from the “reactive policy” defined by just using the policy network without tree search. Looking back at Sutton & Barto, they define “policy” similarly:
(emphasis mine) along with this later description of planning in a model-based RL context:
which seems compatible with thinking of planning algorithms like MCTS as components of an improved policy at runtime (not just in training).
That being said, looking at the AlphaZero paper, a quick search did not turn up usages of the term “policy” in this way. So maybe this usage is less widespread than I had assumed.
Interesting, thanks!