Vaniver comments on Big picture of phasic dopamine

Vaniver 24 Jul 2021 16:00 UTC
LW: 3 AF: 1
AF
But in experiments, they’re not synchronized; the former happens faster than the latter.
This has the effect of incentivizing learning, right? (A system that you don’t yet understand is, in total, more rewarding than an equally yummy system that you do understand.) So it reminds me of exploration in bandit algorithms, which makes sense given the connection to motivation.
- Steven Byrnes 26 Jul 2021 2:25 UTC
  LW: 2 AF: 1
  AF Parent
  Hmm, I guess I mostly disagree because:
  - I see this as sorta an unavoidable aspect of how the system works, so it doesn’t really need an explanation;
  - You’re jumping to “the system will maximize sum of future rewards” but I think RL in the brain is based on “maximize rewards for this step right now” (…and by the way “rewards for this step right now” implicitly involves an approximate assessment of future prospects.) See my comment “Humans are absolute rubbish at calculating a time-integral of reward”.
  - I’m all for exploration, value-of-information, curiosity, etc., just not involving this particular mechanism.
  - Vaniver 26 Jul 2021 5:54 UTC
    LW: 2 AF: 1
    AF Parent
    I guess my sense is that most biological systems are going to be ‘package deals’ instead of ‘cleanly separable’ as much as possible—if you already have a system that’s doing learning, and you can tweak that system in order to get something that gets you some of the benefits of a VoI framework (without actually calculating VoI), I expect biology to do that.
    - Steven Byrnes 26 Jul 2021 12:04 UTC
      LW: 4 AF: 2
      AF Parent
      I agree about the general principle, even if I don’t think this particular thing is an example because of the “not maximizing sum of future rewards” thing.