But in experiments, they’re not synchronized; the former happens faster than the latter.
This has the effect of incentivizing learning, right? (A system that you don’t yet understand is, in total, more rewarding than an equally yummy system that you do understand.) So it reminds me of exploration in bandit algorithms, which makes sense given the connection to motivation.
I see this as sorta an unavoidable aspect of how the system works, so it doesn’t really need an explanation;
You’re jumping to “the system will maximize sum of future rewards” but I think RL in the brain is based on “maximize rewards for this step right now” (…and by the way “rewards for this step right now” implicitly involves an approximate assessment of future prospects.) See my comment “Humans are absolute rubbish at calculating a time-integral of reward”.
I’m all for exploration, value-of-information, curiosity, etc., just not involving this particular mechanism.
I guess my sense is that most biological systems are going to be ‘package deals’ instead of ‘cleanly separable’ as much as possible—if you already have a system that’s doing learning, and you can tweak that system in order to get something that gets you some of the benefits of a VoI framework (without actually calculating VoI), I expect biology to do that.
I agree about the general principle, even if I don’t think this particular thing is an example because of the “not maximizing sum of future rewards” thing.
This has the effect of incentivizing learning, right? (A system that you don’t yet understand is, in total, more rewarding than an equally yummy system that you do understand.) So it reminds me of exploration in bandit algorithms, which makes sense given the connection to motivation.
Hmm, I guess I mostly disagree because:
I see this as sorta an unavoidable aspect of how the system works, so it doesn’t really need an explanation;
You’re jumping to “the system will maximize sum of future rewards” but I think RL in the brain is based on “maximize rewards for this step right now” (…and by the way “rewards for this step right now” implicitly involves an approximate assessment of future prospects.) See my comment “Humans are absolute rubbish at calculating a time-integral of reward”.
I’m all for exploration, value-of-information, curiosity, etc., just not involving this particular mechanism.
I guess my sense is that most biological systems are going to be ‘package deals’ instead of ‘cleanly separable’ as much as possible—if you already have a system that’s doing learning, and you can tweak that system in order to get something that gets you some of the benefits of a VoI framework (without actually calculating VoI), I expect biology to do that.
I agree about the general principle, even if I don’t think this particular thing is an example because of the “not maximizing sum of future rewards” thing.