I agree that much of jittering reflects merely a minor absence of reward-shaping to penalize energy expenditures or wear-and-tear on equipment (the latter especially is why in robotics they do tend to add in tiny penalties for actions/changes to encourage smoothness). And when it learns tactics which depend on ultra-rapid fluctuations, well, that’s usually ‘a feature not a bug’, assuming the environment is faithful to the intended application.
But I still tend to be a little troubled when I see jittering in an agent because it seems like it can reflect pathologies of estimation of values or actions, and to interfere with learning by adding in extraneous variation.
When an agent flips back and forth between actions which are irrelevant, that suggests that the value of the actions are fluctuating rapidly, even though the state of the environment has probably changed only a little; if the agent was learning well, with robust accurate estimation and few weird outliers or overfit estimates, you’d expect more consistency: “in state X, and X+1, and X+2, the best move is to go left”; it would be weird if a single pixel at the edge of the screen being red rather than green convinces the agent to go left—wait now it’s one RGB shade brighter, go right—wait, it’s back, go left—wait, it’s green, go up! - you expect more temporal consistency. (When I read about adversarial attacks on DRL agents, particularly the soccer example, it’s hard not to feel like there’s some connection to jittering there. There’s an analogy there to “non-robust features” in image classification, as well as the original adversarial image attacks: we have a strong intuition that jittering a few pixels should not have any effect.)
In general, it seems like better agents do act more like humans. The hide&seek OA agents or the related DM game agents don’t seem to jitter like the original ALE DQN does; AlphaZero, for example, was noted by both Go & chess pros to play in a much more human-like way than weaker computer Go/chess systems (despite the latter also being superhuman), and I’ve collated many examples of more human-like better-performing systems under the “blessings of scale” rubric. So it seems to me that when an agent is learning clearly inhuman policies like jittering, that is a strong hint that however good it is, it could still be better.
It also seems like it’d interfere with learning: aside from the effect on exploration (jittering looks like epsilon random exploration, about the worst kind), the more disparate actions, the harder it is to estimate the net effect of the key actions or the environmental baseline. If you have only a few actions inside an episode, credit assignment ought to be easier. This might contribute to the previous problem through what you might call “superstitious agents”: by twitching rapidly in a particular pattern, maybe it caused the final victory? How do you know it didn’t? (It only has a very sparse set of episodes interacting with the environment to try to learn these difficult high-dimensional policies trying to solve potentially arbitrary environments, and those episodes are only partially under control & highly stochastic etc.)
I’m still unsure about whether jittering / random action would generally reflect pathology in trained policy or value functions. You’ve convinced me that it reveals pathology in exploration though.
So vis-a-vis policies: in some states, even the optimal policy is indifferent between actions. For such states, we would want a great number of hypotheses about those states to be easily available to the function approximator, because we would have hopefully maintained such a state of easily-available hypotheses from the agent’s untrained state. This probably means flipping between lots of low-certainty hypotheses as the input changes by very small amounts—and because low-certainty hypotheses cannot be reflected in low-certainty action, then we’d have something like jitter. I’m not sure we disagree about this though, and I’m going to have to look into the adversarial RL attacks, which are new to me.
I think I agree though, that random action no longer seems like the best way of exploring at this point, because the agent has encountered the structure of the environment.
I’m not sure if the best implementation of more purposeful exploration is as a side effect of relatively simple RL training on an enormous variety of tasks (as in maybe the Open Ended Learning Paper), where curiosity might be a side-effect—or if the best implementation is with the addition of special curiosity-directed modules. Which of these is the right way to get curiosity and directed exploration seems to me like a really important question at this point—but it’s the former, then I guess we should expect sufficiently generally trained policies to lack true indifference between actions as I describe above, because the “curiosity” would be manifest as low-confidence hypotheses which nevertheless tilt the policy away from actual indifference.
I agree that much of jittering reflects merely a minor absence of reward-shaping to penalize energy expenditures or wear-and-tear on equipment (the latter especially is why in robotics they do tend to add in tiny penalties for actions/changes to encourage smoothness). And when it learns tactics which depend on ultra-rapid fluctuations, well, that’s usually ‘a feature not a bug’, assuming the environment is faithful to the intended application.
But I still tend to be a little troubled when I see jittering in an agent because it seems like it can reflect pathologies of estimation of values or actions, and to interfere with learning by adding in extraneous variation.
When an agent flips back and forth between actions which are irrelevant, that suggests that the value of the actions are fluctuating rapidly, even though the state of the environment has probably changed only a little; if the agent was learning well, with robust accurate estimation and few weird outliers or overfit estimates, you’d expect more consistency: “in state X, and X+1, and X+2, the best move is to go left”; it would be weird if a single pixel at the edge of the screen being red rather than green convinces the agent to go left—wait now it’s one RGB shade brighter, go right—wait, it’s back, go left—wait, it’s green, go up! - you expect more temporal consistency. (When I read about adversarial attacks on DRL agents, particularly the soccer example, it’s hard not to feel like there’s some connection to jittering there. There’s an analogy there to “non-robust features” in image classification, as well as the original adversarial image attacks: we have a strong intuition that jittering a few pixels should not have any effect.)
In general, it seems like better agents do act more like humans. The hide&seek OA agents or the related DM game agents don’t seem to jitter like the original ALE DQN does; AlphaZero, for example, was noted by both Go & chess pros to play in a much more human-like way than weaker computer Go/chess systems (despite the latter also being superhuman), and I’ve collated many examples of more human-like better-performing systems under the “blessings of scale” rubric. So it seems to me that when an agent is learning clearly inhuman policies like jittering, that is a strong hint that however good it is, it could still be better.
It also seems like it’d interfere with learning: aside from the effect on exploration (jittering looks like epsilon random exploration, about the worst kind), the more disparate actions, the harder it is to estimate the net effect of the key actions or the environmental baseline. If you have only a few actions inside an episode, credit assignment ought to be easier. This might contribute to the previous problem through what you might call “superstitious agents”: by twitching rapidly in a particular pattern, maybe it caused the final victory? How do you know it didn’t? (It only has a very sparse set of episodes interacting with the environment to try to learn these difficult high-dimensional policies trying to solve potentially arbitrary environments, and those episodes are only partially under control & highly stochastic etc.)
I’m still unsure about whether jittering / random action would generally reflect pathology in trained policy or value functions. You’ve convinced me that it reveals pathology in exploration though.
So vis-a-vis policies: in some states, even the optimal policy is indifferent between actions. For such states, we would want a great number of hypotheses about those states to be easily available to the function approximator, because we would have hopefully maintained such a state of easily-available hypotheses from the agent’s untrained state. This probably means flipping between lots of low-certainty hypotheses as the input changes by very small amounts—and because low-certainty hypotheses cannot be reflected in low-certainty action, then we’d have something like jitter. I’m not sure we disagree about this though, and I’m going to have to look into the adversarial RL attacks, which are new to me.
I think I agree though, that random action no longer seems like the best way of exploring at this point, because the agent has encountered the structure of the environment.
I’m not sure if the best implementation of more purposeful exploration is as a side effect of relatively simple RL training on an enormous variety of tasks (as in maybe the Open Ended Learning Paper), where curiosity might be a side-effect—or if the best implementation is with the addition of special curiosity-directed modules. Which of these is the right way to get curiosity and directed exploration seems to me like a really important question at this point—but it’s the former, then I guess we should expect sufficiently generally trained policies to lack true indifference between actions as I describe above, because the “curiosity” would be manifest as low-confidence hypotheses which nevertheless tilt the policy away from actual indifference.