Very interesting. I would love to see this worked out in a toy example, where you can see that an RL agent in a grid world does not in general maximize reward, but is able to reason to do… something else. That’s the part I have the hardest time translating into a simulation: what does it mean that the agent is “thinking” about outcomes, if that is something different than running an RL algorithm?
But the essential point that humans choose not to wirehead — or in general to delay or avoid gratification — is a good one. Why do they do this? Is there any RL algorithm that would do this? If not, what sort of algorithm would?
Perhaps the clearest point here is that RL maximizes reward subject to exploration poilicy. For random exploration, perhaps an RL agent is (on average) a reward maximization agent, but it seems likely that no successful learning organism explores randomly.
Very interesting. I would love to see this worked out in a toy example, where you can see that an RL agent in a grid world does not in general maximize reward, but is able to reason to do… something else. That’s the part I have the hardest time translating into a simulation: what does it mean that the agent is “thinking” about outcomes, if that is something different than running an RL algorithm?
But the essential point that humans choose not to wirehead — or in general to delay or avoid gratification — is a good one. Why do they do this? Is there any RL algorithm that would do this? If not, what sort of algorithm would?
Perhaps the clearest point here is that RL maximizes reward subject to exploration poilicy. For random exploration, perhaps an RL agent is (on average) a reward maximization agent, but it seems likely that no successful learning organism explores randomly.