In this context, I mean the “steering system” to refer to the genetically hardcoded reward circuitry which provides intrinsic rewards when certain hardcoded preconditions are met. It isn’t learned. Maybe that’s part of the confusion?
An RL agent is trained to maximize total (discounted) reward. The brain isn’t maximizing total reward, nor trying to maximize total reward, nor is evolution acting on the basis that it’ll do either of these things.
An RL agent is reinforced for maximizing reward, but unless it has already fulfilled the prophecy of a convergence guarantee or unless it’s doing model-based brute-force planning to maximize reward over its time horizon, the RL agent is not actually maximizing reward, nor is it necessarily trying to maximize total reward.
The [objective encoded by the steering system] is not [maximisation of the score assigned by the steering system], but rather [whatever behaviour the steering system tends to produce].
I don’t understand why you hold this view. We probably are talking past each other?
EG if I just have a crude sugar reward circuit in my brain which activates when I am hungry and my taste buds signal the brain in the right way, and then I learn to like licking real-world lollipops (because that’s the only way I was able to stimulate the circuit on training when my values were forming), then the objective encoded by the reward circuit is… lollipop-licking in real life? But also, if I had only been exposed to chocolate on training, I would have learned to like eating chocolate. But also, if I had only been exposed to electrical taste bud stimulation on training, I would have learned to like electrical stimulation.
IMO the objective encoded by the reward circuit is the maximization of its own activations, that’s the optimal policy.
Anyways, I think it would just make more sense for me to link you to a Gdoc explaining my views. PM’d.
In this context, I mean the “steering system” to refer to the genetically hardcoded reward circuitry which provides intrinsic rewards when certain hardcoded preconditions are met. It isn’t learned. Maybe that’s part of the confusion?
An RL agent is reinforced for maximizing reward, but unless it has already fulfilled the prophecy of a convergence guarantee or unless it’s doing model-based brute-force planning to maximize reward over its time horizon, the RL agent is not actually maximizing reward, nor is it necessarily trying to maximize total reward.
I don’t understand why you hold this view. We probably are talking past each other?
EG if I just have a crude sugar reward circuit in my brain which activates when I am hungry and my taste buds signal the brain in the right way, and then I learn to like licking real-world lollipops (because that’s the only way I was able to stimulate the circuit on training when my values were forming), then the objective encoded by the reward circuit is… lollipop-licking in real life? But also, if I had only been exposed to chocolate on training, I would have learned to like eating chocolate. But also, if I had only been exposed to electrical taste bud stimulation on training, I would have learned to like electrical stimulation.
IMO the objective encoded by the reward circuit is the maximization of its own activations, that’s the optimal policy.
Anyways, I think it would just make more sense for me to link you to a Gdoc explaining my views. PM’d.