Seems false in RL, for basically the reason you said (“it’s not clear how to update a model towards performing the task if it intentionally tries to avoid showing us any task-performing behavior”). In other words, if we’re doing on-policy learning, and if the policy never gets anywhere close to a reward>0 zone, then the reward>0 zone isn’t doing anything to shape the policy. (In a human analogy, I can easily avoid getting addicted to nicotine by not exposing myself to nicotine in the first place.)
I think this might be a place where people-thinking-about-gradient-descent have justifiably different intuitions from people-thinking-about-RL.
(The RL problem might be avoidable if we know how to do the task and can turn that knowledge into effective reward-shaping. Also, for a situationally-aware RL model with a wireheading-adjacent desire to get reward per se, we can get it to do arbitrary things by simply telling it what the reward function is.)
Seems false in RL, for basically the reason you said (“it’s not clear how to update a model towards performing the task if it intentionally tries to avoid showing us any task-performing behavior”). In other words, if we’re doing on-policy learning, and if the policy never gets anywhere close to a reward>0 zone, then the reward>0 zone isn’t doing anything to shape the policy. (In a human analogy, I can easily avoid getting addicted to nicotine by not exposing myself to nicotine in the first place.)
I think this might be a place where people-thinking-about-gradient-descent have justifiably different intuitions from people-thinking-about-RL.
(The RL problem might be avoidable if we know how to do the task and can turn that knowledge into effective reward-shaping. Also, for a situationally-aware RL model with a wireheading-adjacent desire to get reward per se, we can get it to do arbitrary things by simply telling it what the reward function is.)