I see this as sorta an unavoidable aspect of how the system works, so it doesn’t really need an explanation;
You’re jumping to “the system will maximize sum of future rewards” but I think RL in the brain is based on “maximize rewards for this step right now” (…and by the way “rewards for this step right now” implicitly involves an approximate assessment of future prospects.) See my comment “Humans are absolute rubbish at calculating a time-integral of reward”.
I’m all for exploration, value-of-information, curiosity, etc., just not involving this particular mechanism.
I guess my sense is that most biological systems are going to be ‘package deals’ instead of ‘cleanly separable’ as much as possible—if you already have a system that’s doing learning, and you can tweak that system in order to get something that gets you some of the benefits of a VoI framework (without actually calculating VoI), I expect biology to do that.
I agree about the general principle, even if I don’t think this particular thing is an example because of the “not maximizing sum of future rewards” thing.
Hmm, I guess I mostly disagree because:
I see this as sorta an unavoidable aspect of how the system works, so it doesn’t really need an explanation;
You’re jumping to “the system will maximize sum of future rewards” but I think RL in the brain is based on “maximize rewards for this step right now” (…and by the way “rewards for this step right now” implicitly involves an approximate assessment of future prospects.) See my comment “Humans are absolute rubbish at calculating a time-integral of reward”.
I’m all for exploration, value-of-information, curiosity, etc., just not involving this particular mechanism.
I guess my sense is that most biological systems are going to be ‘package deals’ instead of ‘cleanly separable’ as much as possible—if you already have a system that’s doing learning, and you can tweak that system in order to get something that gets you some of the benefits of a VoI framework (without actually calculating VoI), I expect biology to do that.
I agree about the general principle, even if I don’t think this particular thing is an example because of the “not maximizing sum of future rewards” thing.