Epistemic Status: I don’t actually know anything about machine learning or reinforcement learning and I’m just following your reasoning/explanation.
From each state, we can just check each possible action against the action-value function $q(s_t, a_t), and choose the action that returns the highest value from the action-value function. Greedy search against the action-value function for the optimal policy is thus equivalent to the optimal policy. For this reason, many algorithms try to learn the action-value function for the optimal policy.
This does not actually follow. Policies return probability distributions over actions (“strategies”), and it’s not necessarily the case that the output of the optimal policy in the current state is a pure strategy.
Mixed strategies are especially important and may be optimal in multi agent environments (a pure Nash equilibrium may not exist, but a mixed Nash equilibrium is guaranteed to exist).
Though maybe for single player decision making, optimal play is never mixed strategies? For any mixed strategy, there may exist an action in that strategy’s support (set of actions that the strategy assigns positive probability to) that has an expected return that is not lower than the strategy itself? I think this may be the case for deterministic environments, but I’m too tired to work out the maths right now.
IIRC randomised choice is mostly useful in multi-agent environments, environments where the environment has free variables in its transition rules that may sensitive to your actions (i.e. the environment itself can be profitably modelled as an agent [where the state transitions are its actions]), or is otherwise non deterministic/stochastic (including stochastic behaviour that arises from uncertainty).
So I think greedy search for the action that attains the highest value for the optimal policy’s action value function is only equivalent to the optimal policy if the environment is:
Deterministic
Fully observable/the agent has perfect information
Agent knows all the “laws of physics”/state transition rules of the environment
Fixed low level state transitions that do not depend on agent
(I may be missing some other criteria necessary to completely obviate mixed strategies.)
I think these conditions are actually quite strong!
That’s 100% true about the quote above being false for environments for which the optimal strategy is stochastic, and a very good catch. I’d expect naive action-value methods to have a lot of trouble in multi agent scenarios.
The ease with which other optimization methods (i.e., policy optimization, which directly adjusts likelihood of different actions, rather than using an estimate of the action-value function to choose actions) represent stochastic policies is one of their advantages over q-learning, which can’t really do so. That’s probably one reason why extremely large-scale RL (i.e., Starcraft, Dota) tend to use more policy optimization (or some complicated mixture of both).
Re. the bullet list, that’s a little too restrictive, at least in some places—for instance, even if an agent doesn’t know all (or even any) of the laws of physics, for instance, in the limit of infinite play action-value based methods can (I think provably) converge to true values. (After all the basic Q-learning never even tries learning the transition function for the environment.)
I think Sutton & Barto or Bertsekas & Tsitsiklas would cover the complete criteria for q-learning to be guaranteed to converge? Although of course in practice, my understanding is it’s quite rare for environments to meet all the criteria and (sometimes!) the methods work anyhow.
Although of course in practice, my understanding is it’s quite rare for environments to meet all the criteria and (sometimes!) the methods work anyhow
I’m sleep deprived as I wrote that/am writing this, so I may be making some technical errors.
The list was supposed to be conditions under which there (is guaranteed to) exist(s) an optimal policy that assigns a pure strategy to every state.
This doesn’t rule out the existence of environments that don’t meet all these criteria and nonetheless have optimal policies that assign pure strategies to some or all states. Such an optimal policy just isn’t guaranteed to exist.
(Some games have pure Nash equilibria/but pure Nash equilibria are not guaranteed to exist in general.)
That said, knowing the laws of physics/transition rules was meant to cover the class of non stochastic environments with multiple possible state transitions from a given state and action.
(Maybe one could say that such environments are non deterministic, but the state transitions could probably be modelled as fully deterministic if one added appropriate hidden state variables and/or allowed a state’s transition to be path dependent.)
It’s in this sense that the agent needs to know the transition rules of the environment for pure strategies to be optimal in general.
Epistemic Status: I don’t actually know anything about machine learning or reinforcement learning and I’m just following your reasoning/explanation.
This does not actually follow. Policies return probability distributions over actions (“strategies”), and it’s not necessarily the case that the output of the optimal policy in the current state is a pure strategy.
Mixed strategies are especially important and may be optimal in multi agent environments (a pure Nash equilibrium may not exist, but a mixed Nash equilibrium is guaranteed to exist).
Though maybe for single player decision making, optimal play is never mixed strategies? For any mixed strategy, there may exist an action in that strategy’s support (set of actions that the strategy assigns positive probability to) that has an expected return that is not lower than the strategy itself? I think this may be the case for deterministic environments, but I’m too tired to work out the maths right now.
IIRC randomised choice is mostly useful in multi-agent environments, environments where the environment has free variables in its transition rules that may sensitive to your actions (i.e. the environment itself can be profitably modelled as an agent [where the state transitions are its actions]), or is otherwise non deterministic/stochastic (including stochastic behaviour that arises from uncertainty).
So I think greedy search for the action that attains the highest value for the optimal policy’s action value function is only equivalent to the optimal policy if the environment is:
Deterministic
Fully observable/the agent has perfect information
Agent knows all the “laws of physics”/state transition rules of the environment
Fixed low level state transitions that do not depend on agent
(I may be missing some other criteria necessary to completely obviate mixed strategies.)
I think these conditions are actually quite strong!
That’s 100% true about the quote above being false for environments for which the optimal strategy is stochastic, and a very good catch. I’d expect naive action-value methods to have a lot of trouble in multi agent scenarios.
The ease with which other optimization methods (i.e., policy optimization, which directly adjusts likelihood of different actions, rather than using an estimate of the action-value function to choose actions) represent stochastic policies is one of their advantages over q-learning, which can’t really do so. That’s probably one reason why extremely large-scale RL (i.e., Starcraft, Dota) tend to use more policy optimization (or some complicated mixture of both).
Re. the bullet list, that’s a little too restrictive, at least in some places—for instance, even if an agent doesn’t know all (or even any) of the laws of physics, for instance, in the limit of infinite play action-value based methods can (I think provably) converge to true values. (After all the basic Q-learning never even tries learning the transition function for the environment.)
I think Sutton & Barto or Bertsekas & Tsitsiklas would cover the complete criteria for q-learning to be guaranteed to converge? Although of course in practice, my understanding is it’s quite rare for environments to meet all the criteria and (sometimes!) the methods work anyhow.
I’m sleep deprived as I wrote that/am writing this, so I may be making some technical errors.
The list was supposed to be conditions under which there (is guaranteed to) exist(s) an optimal policy that assigns a pure strategy to every state.
This doesn’t rule out the existence of environments that don’t meet all these criteria and nonetheless have optimal policies that assign pure strategies to some or all states. Such an optimal policy just isn’t guaranteed to exist.
(Some games have pure Nash equilibria/but pure Nash equilibria are not guaranteed to exist in general.)
That said, knowing the laws of physics/transition rules was meant to cover the class of non stochastic environments with multiple possible state transitions from a given state and action. (Maybe one could say that such environments are non deterministic, but the state transitions could probably be modelled as fully deterministic if one added appropriate hidden state variables and/or allowed a state’s transition to be path dependent.)
It’s in this sense that the agent needs to know the transition rules of the environment for pure strategies to be optimal in general.