Maybe interesting: I think a similar double-counting problem would appear naturally if you tried to train an RL agent in a setting where:
“Reward” is proportional to an estimate of some impartial measure of goodness.
There are multiple identical copies of your RL algorithm (including: they all use the same random seed for exploration).
In a repeated version of the calculator example (importantly: where in each iteration, you randomly decide whether the people who saw “true” get offered a bet or the people who saw “false” get offered a bet — never both), the RL algorithms would learn that, indeed:
99% of the time, they’re in the group where the calculator doesn’t make an error
and on average, when they get offered a bet, they will get more reward afterwards if they take it than if they don’t.
The reason that this happens is because, when the RL agents lose money, there’s fewer agents that associate negative reinforcement with having taken a bet just-before. Whereas whenever they gain money, there’s more agents that associate positive reinforcement with having taken a bet just-before. So the total amount of reinforcement is greater in the latter case, so the RL agents learn to bet. (Despite how this loses them money on average.)
Maybe interesting: I think a similar double-counting problem would appear naturally if you tried to train an RL agent in a setting where:
“Reward” is proportional to an estimate of some impartial measure of goodness.
There are multiple identical copies of your RL algorithm (including: they all use the same random seed for exploration).
In a repeated version of the calculator example (importantly: where in each iteration, you randomly decide whether the people who saw “true” get offered a bet or the people who saw “false” get offered a bet — never both), the RL algorithms would learn that, indeed:
99% of the time, they’re in the group where the calculator doesn’t make an error
and on average, when they get offered a bet, they will get more reward afterwards if they take it than if they don’t.
The reason that this happens is because, when the RL agents lose money, there’s fewer agents that associate negative reinforcement with having taken a bet just-before. Whereas whenever they gain money, there’s more agents that associate positive reinforcement with having taken a bet just-before. So the total amount of reinforcement is greater in the latter case, so the RL agents learn to bet. (Despite how this loses them money on average.)