I’m still unsure about whether jittering / random action would generally reflect pathology in trained policy or value functions. You’ve convinced me that it reveals pathology in exploration though.
So vis-a-vis policies: in some states, even the optimal policy is indifferent between actions. For such states, we would want a great number of hypotheses about those states to be easily available to the function approximator, because we would have hopefully maintained such a state of easily-available hypotheses from the agent’s untrained state. This probably means flipping between lots of low-certainty hypotheses as the input changes by very small amounts—and because low-certainty hypotheses cannot be reflected in low-certainty action, then we’d have something like jitter. I’m not sure we disagree about this though, and I’m going to have to look into the adversarial RL attacks, which are new to me.
I think I agree though, that random action no longer seems like the best way of exploring at this point, because the agent has encountered the structure of the environment.
I’m not sure if the best implementation of more purposeful exploration is as a side effect of relatively simple RL training on an enormous variety of tasks (as in maybe the Open Ended Learning Paper), where curiosity might be a side-effect—or if the best implementation is with the addition of special curiosity-directed modules. Which of these is the right way to get curiosity and directed exploration seems to me like a really important question at this point—but it’s the former, then I guess we should expect sufficiently generally trained policies to lack true indifference between actions as I describe above, because the “curiosity” would be manifest as low-confidence hypotheses which nevertheless tilt the policy away from actual indifference.
I’m still unsure about whether jittering / random action would generally reflect pathology in trained policy or value functions. You’ve convinced me that it reveals pathology in exploration though.
So vis-a-vis policies: in some states, even the optimal policy is indifferent between actions. For such states, we would want a great number of hypotheses about those states to be easily available to the function approximator, because we would have hopefully maintained such a state of easily-available hypotheses from the agent’s untrained state. This probably means flipping between lots of low-certainty hypotheses as the input changes by very small amounts—and because low-certainty hypotheses cannot be reflected in low-certainty action, then we’d have something like jitter. I’m not sure we disagree about this though, and I’m going to have to look into the adversarial RL attacks, which are new to me.
I think I agree though, that random action no longer seems like the best way of exploring at this point, because the agent has encountered the structure of the environment.
I’m not sure if the best implementation of more purposeful exploration is as a side effect of relatively simple RL training on an enormous variety of tasks (as in maybe the Open Ended Learning Paper), where curiosity might be a side-effect—or if the best implementation is with the addition of special curiosity-directed modules. Which of these is the right way to get curiosity and directed exploration seems to me like a really important question at this point—but it’s the former, then I guess we should expect sufficiently generally trained policies to lack true indifference between actions as I describe above, because the “curiosity” would be manifest as low-confidence hypotheses which nevertheless tilt the policy away from actual indifference.