TurnTrout comments on Reward is not the optimization target

TurnTrout 7 Aug 2022 16:23 UTC
LW: 4 AF: 3
2
AF
I perceive you as saying “These statements can make sense.” If so, the point isn’t that they can’t be viewed as correct in some sense—that no one sane could possibly emit such statements. The point is that these quotes are indicative of misunderstanding the points of this essay. That if someone says a point as quoted, that’s unfavorable evidence on this question.
This describes some possible goals, and I don’t see why you think the goals listed are impossible (and don’t think they are).
I wasn’t implying they’re impossible, I was implying that this is somewhat misguided. Animals learn to achieve goals like “optimizing… the expected sume of future rewards”? That’s exactly what I’m arguing against as improbable.
- DanielFilan 9 Aug 2022 19:14 UTC
  LW: 5 AF: 4
  −8
  AF Parent
  I’m not saying “These statements can make sense”, I’m saying they do make sense and are correct under their most plain reading.
  
  Re: a possible goal of animals being to optimize the expected sum of future rewards, in the cited paper “rewards” appears to refer to stuff like eating tasty food or mating, where it’s assumed the animal can trade those off against each other consistently:
  
  Decision-making environments are characterized by a few key concepts: a state space..., a set of actions..., and affectively important outcomes (finding cheese, obtaining water, and winning). Actions can move the decision-maker from one state to another (i.e. induce state transitions) and they can produce outcomes. The outcomes are assumed to have numerical (positive or negative) utilities, which can change according to the motivational state of the decision-maker (e.g. food is less valuable to a satiated animal) or direct experimental manipulation (e.g. poisoning)...
  
  In instrumental conditioning, animals learn to choose actions to obtain rewards and avoid punishments, or, more generally to achieve goals. Various goals are possible, such as optimizing the average rate of acquisition of net rewards (i.e. rewards minus punishments), or some proxy for this such as the expected sum of future rewards[.]
  
  It seems totally plausible to me that an animal could be motivated to optimize the expected sum of future rewards in this sense, given that ‘reward’ is basically defined as “things they value”. It seems like the way this would be false would be if animals rewards are super unstable, or the animal doesn’t coherently trade off things they value. This could happen, but I don’t see why I should see it as overwhelmingly likely.
  
  [EDIT: in other words, the reason the paper conflates ‘rewards’ with ‘optimization target’ is that that’s how they’re defining rewards]
  - TurnTrout 15 Aug 2022 3:39 UTC
    LW: 4 AF: 3
    0
    AF Parent
    I’m not saying “These statements can make sense”, I’m saying they do make sense and are correct under their most plain reading.
    Yup, strong disagree with that.
    “rewards” appears to refer to stuff like eating tasty food or mating, where it’s assumed the animal can trade those off against each other consistently:
    If that were true, that would definitely be a good counterpoint and mean I misread it. If so, I’d retract my original complaint with that passage. But I’m not convinced that it’s true. The previous paragraph just describes finding cheese as an “affectively important outcome.” Then, later, “outcomes are assumed to have numerical… utilities.” So they’re talking about utility now, OK. But then they talk about rewards. Is this utility? It’s not outcomes (like finding cheese), because you can’t take the expected sum of future finding-cheeses—type error!
    When I ctrl+F rewards and scroll through, and it sure seems like they’re talking about dopamine or RPE or that-which-gets-discounted-and-summed-to-produce-the-return, which lines up with my interpretation.
    - DanielFilan 15 Aug 2022 21:55 UTC
      LW: 4 AF: 4
      2
      AF Parent
      
      dopamine or RPE or that-which-gets-discounted-and-summed-to-produce-the-return
      
      Those are three pretty different things—the first is a chemical, the second I guess stands for ‘reward prediction error’, and the third is a mathematical quantity! Like, you also can’t talk about the expected sum of dopamine, because dopamine is a chemical, not a number!
      
      Here’s how I interpret the paper: stuff in the world is associated with ‘rewards’, which are real numbers that represent how good the stuff is. Then the ‘return’ of some period of time is the discounted sum of rewards. Rewards represent ‘utilities’ of individual bits of time, but the return function is the actual utility function over trajectories. ‘Predictions of reward’ means predictions of stuff like bits of cheese that is associated with reward. I do think the authors do a bit of equivocation between the numbers and the things that the numbers represent (which IMO is typical for non-mathematicians, see also how physicists constantly conflate quantities like velocity with the functions that take other physical quantities and return the velocity of something), but given that AFAICT my interpretation accounts for the uses of ‘reward’ in that paper (and in the intro). That said, there are a bunch of them, and as a fallible human I’m probably not good at finding the uses that undermine my theory, so if you have a quote or two in mind that makes more sense under the interpretation that ‘reward’ refers to some function of a brain state rather than some function of cheese consumption or whatever, I’d appreciate you pointing them out to me.