For example, if the world is symmetric in the appropriate sense in terms of what actions get you rewarded or penalized, and you maximize expected utility instead of satisficing in some way, then the argument is wrong. I’m sure there is good literature on how to model evolution as a player, and the modeling of the environment shouldn’t be difficult.
I would think it would hold even in that case, why is it clearly wrong?
I may be mistaken. I tried reversing your argument, and I bold the part that doesn’t feel right.
Optimistic errors are no big deal. The agent will randomly seek behaviours that get rewarded, but as long as these behaviours are reasonably rare (and are not that bad) then that’s not too costly.
But pessimistic errors are catastrophic. The agent will systematically make sure not to fall into behaviors that avoid high punishment, and will use loopholes to avoid penalties even if that results in the loss of something really good. So even if these errors are extremely rare initially, they can totally mess up my agent.
So I think that maybe there is inherently an asymmetry between reward and punishment when dealing with maximizers.
But my intuition comes from somewhere else. If the difference between pessimism and optimism is given by a shift by a constant then it ought not matter for a utility maximizer. But your definition goes at errors conditional on the actual outcome, which should perhaps behave differently.
I think this part of the reversed argument is wrong:
The agent will randomly seek behaviours that get rewarded, but as long as these behaviours are reasonably rare (and are not that bad) then that’s not too costly
Even if the behaviors are very rare, and have a “normal” reward, then the agent will seek them out and so miss out on actually good states.
I would think it would hold even in that case, why is it clearly wrong?
I may be mistaken. I tried reversing your argument, and I bold the part that doesn’t feel right.
So I think that maybe there is inherently an asymmetry between reward and punishment when dealing with maximizers.
But my intuition comes from somewhere else. If the difference between pessimism and optimism is given by a shift by a constant then it ought not matter for a utility maximizer. But your definition goes at errors conditional on the actual outcome, which should perhaps behave differently.
I think this part of the reversed argument is wrong:
Even if the behaviors are very rare, and have a “normal” reward, then the agent will seek them out and so miss out on actually good states.
But there are behaviors we always seek out. Trivially, eating, and sleeping.