Pessimistic errors are no big deal. The agent will randomly avoid behaviors that get penalized, but as long as those behaviors are reasonably rare (and aren’t the only way to get a good outcome) then that’s not too costly.
But optimistic errors are catastrophic. The agent will systematically seek out the behaviors that receive the high reward, and will use loopholes to avoid penalties when something actually bad happens. So even if these errors are extremely rare initially, they can totally mess up my agent.
I’d love to see someone analyze this thoroughly (or I’ll do it if there will be an interest). I don’t think it’s that simple, and it seems like this is the main analytical argument.
For example, if the world is symmetric in the appropriate sense in terms of what actions get you rewarded or penalized, and you maximize expected utility instead of satisficing in some way, then the argument is wrong. I’m sure there is good literature on how to model evolution as a player, and the modeling of the environment shouldn’t be difficult.
I suspect it all comes down to modeling of outcome distributions. If there’s a narrow path to success, then both biases are harmful. If there are a lot of ways to win, and a few disasters, then optimism bias is very harmful, as it makes the agent not loss-averse enough. If there are a lot of ways to win a little, and few ways to win a lot, then pessimism bias is likely to miss the big wins, as it’s trying to avoid minor losses.
I’d really enjoy an analysis focused on your conditions (maximize vs satisfice, world symmetry) - especially what kinds of worlds and biased predictors lead satisficing to get better outcomes than optimizing.
For example, if the world is symmetric in the appropriate sense in terms of what actions get you rewarded or penalized, and you maximize expected utility instead of satisficing in some way, then the argument is wrong. I’m sure there is good literature on how to model evolution as a player, and the modeling of the environment shouldn’t be difficult.
I would think it would hold even in that case, why is it clearly wrong?
I may be mistaken. I tried reversing your argument, and I bold the part that doesn’t feel right.
Optimistic errors are no big deal. The agent will randomly seek behaviours that get rewarded, but as long as these behaviours are reasonably rare (and are not that bad) then that’s not too costly.
But pessimistic errors are catastrophic. The agent will systematically make sure not to fall into behaviors that avoid high punishment, and will use loopholes to avoid penalties even if that results in the loss of something really good. So even if these errors are extremely rare initially, they can totally mess up my agent.
So I think that maybe there is inherently an asymmetry between reward and punishment when dealing with maximizers.
But my intuition comes from somewhere else. If the difference between pessimism and optimism is given by a shift by a constant then it ought not matter for a utility maximizer. But your definition goes at errors conditional on the actual outcome, which should perhaps behave differently.
I think this part of the reversed argument is wrong:
The agent will randomly seek behaviours that get rewarded, but as long as these behaviours are reasonably rare (and are not that bad) then that’s not too costly
Even if the behaviors are very rare, and have a “normal” reward, then the agent will seek them out and so miss out on actually good states.
I think that the intuition for this argument comes from something like a gradient ascent under an approximate utility function. The agent will spend most of it’s time near what it perceives to be a local(ish) maximum.
So I suspect the argument here is that Optimistic Errors have a better chance of locking into a single local maximum or strategy, which get’s reinforced enough (or not punished enough), even though it is bad in total.
Pessimistic Errors are ones in which the agent strategically avoids locking into maxima, perhaps by Hedonic Adaptation as Dagon suggested. This may miss big opportunities if there are actual, territorial, big maxima, but that may not be as bad (from a satisficer point of view at least).
I’d love to see someone analyze this thoroughly (or I’ll do it if there will be an interest). I don’t think it’s that simple, and it seems like this is the main analytical argument.
For example, if the world is symmetric in the appropriate sense in terms of what actions get you rewarded or penalized, and you maximize expected utility instead of satisficing in some way, then the argument is wrong. I’m sure there is good literature on how to model evolution as a player, and the modeling of the environment shouldn’t be difficult.
I suspect it all comes down to modeling of outcome distributions. If there’s a narrow path to success, then both biases are harmful. If there are a lot of ways to win, and a few disasters, then optimism bias is very harmful, as it makes the agent not loss-averse enough. If there are a lot of ways to win a little, and few ways to win a lot, then pessimism bias is likely to miss the big wins, as it’s trying to avoid minor losses.
I’d really enjoy an analysis focused on your conditions (maximize vs satisfice, world symmetry) - especially what kinds of worlds and biased predictors lead satisficing to get better outcomes than optimizing.
I would think it would hold even in that case, why is it clearly wrong?
I may be mistaken. I tried reversing your argument, and I bold the part that doesn’t feel right.
So I think that maybe there is inherently an asymmetry between reward and punishment when dealing with maximizers.
But my intuition comes from somewhere else. If the difference between pessimism and optimism is given by a shift by a constant then it ought not matter for a utility maximizer. But your definition goes at errors conditional on the actual outcome, which should perhaps behave differently.
I think this part of the reversed argument is wrong:
Even if the behaviors are very rare, and have a “normal” reward, then the agent will seek them out and so miss out on actually good states.
But there are behaviors we always seek out. Trivially, eating, and sleeping.
I think that the intuition for this argument comes from something like a gradient ascent under an approximate utility function. The agent will spend most of it’s time near what it perceives to be a local(ish) maximum.
So I suspect the argument here is that Optimistic Errors have a better chance of locking into a single local maximum or strategy, which get’s reinforced enough (or not punished enough), even though it is bad in total.
Pessimistic Errors are ones in which the agent strategically avoids locking into maxima, perhaps by Hedonic Adaptation as Dagon suggested. This may miss big opportunities if there are actual, territorial, big maxima, but that may not be as bad (from a satisficer point of view at least).
If this is the case, this seems more like a difference in exploration/exploitation strategies.
We do have positively valenced heuristics for exploration—say curiosity and excitement