(To summarize my upcoming point in tl;dr form: if you don’t find yourself rationalizing “maybe I’m onto the pattern” while your stomach rumbles as you contemplate the upside of getting gummi bears marginally more often, you might be tickling a different variety-seeking mechanism than you think. Nothing wrong with that, but if you want to get really good at optimizing that tickle, detailed knowledge about which mechanism it is might be helpful.)
From time to time when reading technical articles related to effective strategies for artificial agents faced with “n-armed bandit” problems, I am reminded of observed animal behavior patterns like PREE, and wonder how close the correspondence might be. N-armed bandits have been studied for a long time, and it seems like an obvious conjecture, but I have never seen much analysis of this. I never encounter such analysis spontaneously when people talk about a particular psych observation, even at ML-friendly sites like LW. And hunting for it with e.g. Google “partial reinforcement n-armed bandit” suggests that it must be a pretty obscure topic, because in the articles I find, the analysis I am looking for is swamped by different topics like reinforcement learning, and obscure topics like how a web designer trying to optimize humans’ response to the website can usefully think of his website A/B testing as an n-armed bandit problem.
Can anyone recommend systematic attempts to explore how close this correspondence might be?
Of the usual pop psychology examples of overresponse to partial reinforcement, it looks to me as though gambling truly is narrowly tuned to the PREE phenomenon, and is working essentially by fooling an agent designed to solve a bandit problem. Other examples, however, tend to be sufficiently ambiguous or contradictory in various ways that I thinksomething unrelated could be going on. Humans can respond to variety in all sorts of positive ways. E.g., (dammit, I’m going blank on the name of) the classic confounding effect in industrial productivity studies where change itself, in either direction, can easily cause a positive effect independent of whether the new situation is objectively better in any useful sense.
Notice that successful gambling operations are contrived so that if you ever did discover even a small pattern (e.g., 53% success instead of 48%) it follows by perfectly correct analysis that the discovery would be enormously valuable. Under such extreme conditions, even a small nudge from a simple n-armed bandit heuristic (like a nagging intuition corresponding to a high prior probability that high variance implies a significant probability of discovering something that improves performance by a mere 10%) can get amplified to dramatically wrong behavior. Also notice that there is a strong observed pattern of compulsive gamblers fooling themselves into thinking they are finding small patterns. If gambling were a case of partial reinforcement directly tickling purely subconscious deep structures unrelated to n-armed banditry, then “I’m onto the pattern” might still sometimes be used to rationalize the irrational behavior, but it’s not clear why it would be a strongly favored rationalization (compared to, e.g., “risk taking makes me glamorous”).
Compare this to behavior patterns that aren’t observed. E.g., I’ve never heard of anyone making cigarettes qualitatively more addictive by making them unpredictable, e.g., by selling mixed packs of placebo and nicotinic cigarettes. Could this be because there’s no way for the robot to get rationally excited about the enormous upside of spotting a small pattern in such randomness? (And anything close to this which does succeed, e.g. toy prizes in cereal boxes, tends to be successful for only about as long as the robot’s inputs from the world model let it be strongly uncertain about the upside..)
People do claim to spot partial-reinforcement-related phenomena in other behavior patterns which can’t easily be explained as a bandit problem heuristic being tricked. E.g., people often accuse World of Warcraft and similar games of manipulating the intermittent reward mechanism to cause addictive behavior. WoW treasures are indeed randomized, and people do indeed become fascinated by the game, and I don’t see how the robot could be getting excited about a huge upside of spotting the pattern. But WoW is in the entertainment industry. WoW developers could have saved a substantial amount of money by hiring far fewer artists to create far fewer kinds of trees and other decorations, but that it would be a bad idea. Hollywood could save even more money by aggressively reusing sets and actors and props and scripts between movies. Even in extremes like soap operas where many customers are looking for repetitive essentially-predictable escape, a successful entertainment product benefits from many kinds of variety. It seems to me that the positive importance of randomizing treasures needn’t be explained by partial reinforcement any more than the positive importance, in a soap opera in which villains walk onto the stage in hundreds of different episodes, of avoiding a clear pattern of villains entering stage right every single time.
(To summarize my upcoming point in tl;dr form: if you don’t find yourself rationalizing “maybe I’m onto the pattern” while your stomach rumbles as you contemplate the upside of getting gummi bears marginally more often, you might be tickling a different variety-seeking mechanism than you think. Nothing wrong with that, but if you want to get really good at optimizing that tickle, detailed knowledge about which mechanism it is might be helpful.)
From time to time when reading technical articles related to effective strategies for artificial agents faced with “n-armed bandit” problems, I am reminded of observed animal behavior patterns like PREE, and wonder how close the correspondence might be. N-armed bandits have been studied for a long time, and it seems like an obvious conjecture, but I have never seen much analysis of this. I never encounter such analysis spontaneously when people talk about a particular psych observation, even at ML-friendly sites like LW. And hunting for it with e.g. Google “partial reinforcement n-armed bandit” suggests that it must be a pretty obscure topic, because in the articles I find, the analysis I am looking for is swamped by different topics like reinforcement learning, and obscure topics like how a web designer trying to optimize humans’ response to the website can usefully think of his website A/B testing as an n-armed bandit problem.
Can anyone recommend systematic attempts to explore how close this correspondence might be?
Of the usual pop psychology examples of overresponse to partial reinforcement, it looks to me as though gambling truly is narrowly tuned to the PREE phenomenon, and is working essentially by fooling an agent designed to solve a bandit problem. Other examples, however, tend to be sufficiently ambiguous or contradictory in various ways that I thinksomething unrelated could be going on. Humans can respond to variety in all sorts of positive ways. E.g., (dammit, I’m going blank on the name of) the classic confounding effect in industrial productivity studies where change itself, in either direction, can easily cause a positive effect independent of whether the new situation is objectively better in any useful sense.
Notice that successful gambling operations are contrived so that if you ever did discover even a small pattern (e.g., 53% success instead of 48%) it follows by perfectly correct analysis that the discovery would be enormously valuable. Under such extreme conditions, even a small nudge from a simple n-armed bandit heuristic (like a nagging intuition corresponding to a high prior probability that high variance implies a significant probability of discovering something that improves performance by a mere 10%) can get amplified to dramatically wrong behavior. Also notice that there is a strong observed pattern of compulsive gamblers fooling themselves into thinking they are finding small patterns. If gambling were a case of partial reinforcement directly tickling purely subconscious deep structures unrelated to n-armed banditry, then “I’m onto the pattern” might still sometimes be used to rationalize the irrational behavior, but it’s not clear why it would be a strongly favored rationalization (compared to, e.g., “risk taking makes me glamorous”).
Compare this to behavior patterns that aren’t observed. E.g., I’ve never heard of anyone making cigarettes qualitatively more addictive by making them unpredictable, e.g., by selling mixed packs of placebo and nicotinic cigarettes. Could this be because there’s no way for the robot to get rationally excited about the enormous upside of spotting a small pattern in such randomness? (And anything close to this which does succeed, e.g. toy prizes in cereal boxes, tends to be successful for only about as long as the robot’s inputs from the world model let it be strongly uncertain about the upside..)
People do claim to spot partial-reinforcement-related phenomena in other behavior patterns which can’t easily be explained as a bandit problem heuristic being tricked. E.g., people often accuse World of Warcraft and similar games of manipulating the intermittent reward mechanism to cause addictive behavior. WoW treasures are indeed randomized, and people do indeed become fascinated by the game, and I don’t see how the robot could be getting excited about a huge upside of spotting the pattern. But WoW is in the entertainment industry. WoW developers could have saved a substantial amount of money by hiring far fewer artists to create far fewer kinds of trees and other decorations, but that it would be a bad idea. Hollywood could save even more money by aggressively reusing sets and actors and props and scripts between movies. Even in extremes like soap operas where many customers are looking for repetitive essentially-predictable escape, a successful entertainment product benefits from many kinds of variety. It seems to me that the positive importance of randomizing treasures needn’t be explained by partial reinforcement any more than the positive importance, in a soap opera in which villains walk onto the stage in hundreds of different episodes, of avoiding a clear pattern of villains entering stage right every single time.