That’s 100% true about the quote above being false for environments for which the optimal strategy is stochastic, and a very good catch. I’d expect naive action-value methods to have a lot of trouble in multi agent scenarios.
The ease with which other optimization methods (i.e., policy optimization, which directly adjusts likelihood of different actions, rather than using an estimate of the action-value function to choose actions) represent stochastic policies is one of their advantages over q-learning, which can’t really do so. That’s probably one reason why extremely large-scale RL (i.e., Starcraft, Dota) tend to use more policy optimization (or some complicated mixture of both).
Re. the bullet list, that’s a little too restrictive, at least in some places—for instance, even if an agent doesn’t know all (or even any) of the laws of physics, for instance, in the limit of infinite play action-value based methods can (I think provably) converge to true values. (After all the basic Q-learning never even tries learning the transition function for the environment.)
I think Sutton & Barto or Bertsekas & Tsitsiklas would cover the complete criteria for q-learning to be guaranteed to converge? Although of course in practice, my understanding is it’s quite rare for environments to meet all the criteria and (sometimes!) the methods work anyhow.
Although of course in practice, my understanding is it’s quite rare for environments to meet all the criteria and (sometimes!) the methods work anyhow
I’m sleep deprived as I wrote that/am writing this, so I may be making some technical errors.
The list was supposed to be conditions under which there (is guaranteed to) exist(s) an optimal policy that assigns a pure strategy to every state.
This doesn’t rule out the existence of environments that don’t meet all these criteria and nonetheless have optimal policies that assign pure strategies to some or all states. Such an optimal policy just isn’t guaranteed to exist.
(Some games have pure Nash equilibria/but pure Nash equilibria are not guaranteed to exist in general.)
That said, knowing the laws of physics/transition rules was meant to cover the class of non stochastic environments with multiple possible state transitions from a given state and action.
(Maybe one could say that such environments are non deterministic, but the state transitions could probably be modelled as fully deterministic if one added appropriate hidden state variables and/or allowed a state’s transition to be path dependent.)
It’s in this sense that the agent needs to know the transition rules of the environment for pure strategies to be optimal in general.
That’s 100% true about the quote above being false for environments for which the optimal strategy is stochastic, and a very good catch. I’d expect naive action-value methods to have a lot of trouble in multi agent scenarios.
The ease with which other optimization methods (i.e., policy optimization, which directly adjusts likelihood of different actions, rather than using an estimate of the action-value function to choose actions) represent stochastic policies is one of their advantages over q-learning, which can’t really do so. That’s probably one reason why extremely large-scale RL (i.e., Starcraft, Dota) tend to use more policy optimization (or some complicated mixture of both).
Re. the bullet list, that’s a little too restrictive, at least in some places—for instance, even if an agent doesn’t know all (or even any) of the laws of physics, for instance, in the limit of infinite play action-value based methods can (I think provably) converge to true values. (After all the basic Q-learning never even tries learning the transition function for the environment.)
I think Sutton & Barto or Bertsekas & Tsitsiklas would cover the complete criteria for q-learning to be guaranteed to converge? Although of course in practice, my understanding is it’s quite rare for environments to meet all the criteria and (sometimes!) the methods work anyhow.
I’m sleep deprived as I wrote that/am writing this, so I may be making some technical errors.
The list was supposed to be conditions under which there (is guaranteed to) exist(s) an optimal policy that assigns a pure strategy to every state.
This doesn’t rule out the existence of environments that don’t meet all these criteria and nonetheless have optimal policies that assign pure strategies to some or all states. Such an optimal policy just isn’t guaranteed to exist.
(Some games have pure Nash equilibria/but pure Nash equilibria are not guaranteed to exist in general.)
That said, knowing the laws of physics/transition rules was meant to cover the class of non stochastic environments with multiple possible state transitions from a given state and action. (Maybe one could say that such environments are non deterministic, but the state transitions could probably be modelled as fully deterministic if one added appropriate hidden state variables and/or allowed a state’s transition to be path dependent.)
It’s in this sense that the agent needs to know the transition rules of the environment for pure strategies to be optimal in general.