Steven Byrnes comments on Think carefully before calling RL policies “agents”

Steven Byrnes 2 Jun 2023 13:02 UTC
LW: 21 AF: 9
6
AF
I think I see where you’re coming from but I generally have mixed feelings, and am going back and forth but leaning towards sticking with textbook terminology for my part.
Once we fix the policy network and sampling procedure, we get a mapping from observations…to probability distributions over outputs…. This mapping $π$ is the policy.…
Of course, a policy could in fact be computed using internal planning (e.g. depth-3 heuristic search) to achieve an internally represented goal (e.g. number of diamonds predicted to be present.) I think it’s appropriate to call that kind of computation “agentic.” But that designation is only appropriate after further information is discovered (e.g. how the policy in fact works).
I’m a bit confused about what you’re proposing. AlphaZero has an input (board state) and an output (move). Are you proposing to call this input-output function “a policy”?
If so, sure we can say that, but I think people would find it confusing—because there’s a tree search in between the input and output, and one ingredient of the tree search is the “policy network” (or maybe just “policy head”, I forget), but here the relation between the “policy network” and the final input-output function is very indirect, such that it seems odd to use (almost) the same term for them.
Don’t assume the conclusion by calling a policy an “agent”
The word “agent” invokes a bundle of intuitions / associations, and you think many of those are misleading in general. So then one approach is to ask everyone to avoid the word “agent” in cases where those intuitions don’t apply, and the other is to ask everyone to constantly remind each other that the “agents” produced by RL don’t necessarily have thus-and-such properties.
Neither option is great; this is obviously a judgment call.
For my part, I think that if I say:
“An RL agent isn’t necessarily planning ahead towards goals, in many cases it’s better to think of it as a bundle of situation-dependent reactions…”
…then that strikes me as a normal kind of thing to say as part of a healthy & productive conversation.
So maybe I see pushing-back-on-the-intuitions-while-keeping-the-word as a more viable approach than you do.
(And separately, I see editing widely-used terminology as a very very big cost, probably moreso than you.)
Ditto for “reward”.
“Reinforcement-maximizing policy”
this kinda sounds slightly weird in my mind because I seem to be intuitively associating “reinforcement” with “updates” and the policy in question is a fixed-point that stops getting updated altogether.
I…don’t currently think RL is much more dangerous than other ways of computing weight updates
You mention that this is off-topic so maybe you don’t want to discuss it, but I probably disagree with that—with the caveat that it’s very difficult to do an other-things-equal comparison. (I.e., we’re presumably interested in RL-safety-versus-SSL-safety holding capabilities fixed, but switching from RL to SSL does have an effect on capabilities.)
Then later you say “only using unsupervised pretraining doesn’t mean you’re safe” which is a much weaker statement, and I agree with it.
- cfoster0 2 Jun 2023 16:19 UTC
  7 points
  1
  Parent
  I’m a bit confused about what you’re proposing. AlphaZero has an input (board state) and an output (move). Are you proposing to call this input-output function “a policy”?
  If so, sure we can say that, but I think people would find it confusing—because there’s a tree search in between the input and output, and one ingredient of the tree search is the “policy network” (or maybe just “policy head”, I forget), but here the relation between the “policy network” and the final input-output function is very indirect, such that it seems odd to use (almost) the same term for them.
  In my head, a policy is just a situation-dependent way of acting. Sometimes that way of acting makes use of foresight, sometimes that way of acting is purely reflexive. I mentally file the AlphaZero policy network + tree search combination as a “policy”, one separate from the “reactive policy” defined by just using the policy network without tree search. Looking back at Sutton & Barto, they define “policy” similarly:
  A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. It corresponds to what in psychology would be called a set of stimulus–response rules or associations. In some cases the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process. The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic, specifying probabilities for each action.
  (emphasis mine) along with this later description of planning in a model-based RL context:
  The word planning is used in several different ways in different fields. We use the term to refer to any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment
  which seems compatible with thinking of planning algorithms like MCTS as components of an improved policy at runtime (not just in training).
  That being said, looking at the AlphaZero paper, a quick search did not turn up usages of the term “policy” in this way. So maybe this usage is less widespread than I had assumed.
  - Steven Byrnes 3 Jun 2023 12:46 UTC
    2 points
    0
    Parent
    Interesting, thanks!
- TurnTrout 5 Jun 2023 18:11 UTC
  LW: 4 AF: 4
  0
  AF Parent
  So then one approach is to ask everyone to avoid the word “agent” in cases where those intuitions don’t apply, and the other is to ask everyone to constantly remind each other that the “agents” produced by RL don’t necessarily have thus-and-such properties.
  I think there’s a way better third alternative: asking each reader to unilaterally switch to “policy.” No coordination, no constant reminders, no communication difficulties (in my experience). I therefore don’t see a case for using “agent” in the mentioned cases.
  I added to the post:
  Don’t wait for everyone to coordinate on saying “policy.” You can switch to “policy” right now and thereby improve your private thoughts about alignment, whether or not anyone else gets on board. I’ve enjoyed these benefits for a month. The switch didn’t cause communication difficulties.