habryka comments on TurnTrout’s shortform feed

habryka 3 Feb 2024 9:04 UTC
6 points
2
I don’t understand why “reinforcement” is better than “reward”? They both invoke the same image to me.
If you reward someone for a task, they might or might not end up reliably wanting to do the task. Same if you “reinforce” them to do that task. “Reinforce” is more abstract, which seems generally worse for communication, so I would mildly encourage people to use “reward function”, but mostly expect other context cues to determine which one is better and don’t have a strong general take.
- Steven Byrnes 3 Feb 2024 16:49 UTC
  8 points
  6
  Parent
  My understanding of Alex’s point is that the word “reward” invokes a mental image involving model-based planning—”ooh, there’s a reward, what can I do right now to get it?”. And the word “reinforcement” invokes a mental image involving change (i.e. weight updates)—when you reinforce a bridge, you’re permanently changing something about the structure of the bridge, such that the bridge will be (hopefully) better in the future than it was in the past.
  So if you want to reason about policy-gradient-based RL algorithms (for example), that’s a (pro tanto) reason to use the term “reinforcement”. (OTOH, if you want to reason about RL-that-mostly-involves-model-based-planning, maybe that’s a reason not to!)
  For my own writing, I went back and forth a bit, but wound up deciding to stick with textbook terminology (“reward function” etc.), for various reasons including all the usual reasons that using textbook terminology is generally good for communication, plus there’s an irreconcilable jargon-clash around what the specific term “negative reinforcement” means (cf. the behaviorist literature). But I try to be self-aware of situations where people’s intuitions around the word “reward” might be leading them astray, in context, so I can explicitly call it out and try to correct that.
  - habryka 3 Feb 2024 17:14 UTC
    4 points
    0
    Parent
    Yeah, not being able to say “negative reward”/”punishment” when you use “reinforcement” seems very costly. I’ve run into that problem a bunch.
    And yeah, that makes sense. I get the “reward implies more model based-thinking” part. I kind of like that distinction, so am tentatively in-favor of using “reward” for more model-based stuff, and “reinforcement” for more policy-gradient based stuff, if other considerations don’t outweigh that.
- tailcalled 3 Feb 2024 9:30 UTC
  8 points
  0
  Parent
  I think it makes sense to have a specific word for the thing where you do $w_{n e w} = w + r \cdot \nabla_{w} l o g P (o | w)$ after the network with weights $w$ has given an output $o$ (or variants thereof, e.g. DPO). TurnTrout seems basically correct in saying that it’s common for rationalists to mistakenly think the network will be consequentialistically aiming to get a lot of these updates, even though it really won’t.
  
  On the other hand I think TurnTrout lacks a story for what happens with stuff like DreamerV3.
  - quetzal_rainbow 3 Feb 2024 13:08 UTC
    1 point
    0
    Parent
    As far as I understand, “reward is not the optimization target” is about model-free RL, while DreamerV3 is model-based.
    - tailcalled 3 Feb 2024 15:45 UTC
      6 points
      0
      Parent
      Yep, which is basically my point. I can’t think of any case where I’ve seen him discuss the distinction.
      - TurnTrout 3 Feb 2024 20:48 UTC
        2 points
        0
        Parent
        From the third paragraph of Reward is not the optimization target:
        ETA 9/18/23: This post addresses the model-free policy gradient setting, including algorithms like PPO and REINFORCE.
        tailcalled 3 Feb 2024 22:03 UTC
        2 points
        0
        Parent
        I see. I think maybe I read it when it came out so I didn’t see the update.
        Regarding the
        Not worth getting into?
        I’m guessing it’s probably not worth the time to resolve this?
        react:
        I’d guess it’s worth getting into because this disagreement is a symptom of the overall question I have about your approach/view.
        Though on the other hand maybe it is not worth getting into because maybe once I publish a description of this you’ll basically go “yeah that seems like a reasonable resolution, let’s go with that”.