Over the past few days I’ve been reading about reinforcement learning, because I understood how to make a neural network, say, recognise handwritten digits, but I wasn’t sure how at all that could be turned into getting a computer to play Atari games. So: what I’ve learned so far. Spinning Up’s Intro to RL probably explains this better.
(Brief summary, explained properly below: The agent is a neural network which runs in an environment and receives a reward. Each parameter in the neural network is increased in proportion to how much it increases the probability of making the agent do what it just did, and how good the outcome of what the agent just did was.)
Reinforcement learners play inside a game involving an agent and an environment. On turn t, the environment hands the agent an observation ot, and the agent hands the environment an action at. For an agent acting in realtime, there can be sixty turns a second; this is fine.
The environment has a transition function which takes an observation-action pair otat and responds with a probability distribution over observations on the next timestep ot+1; the agent has a policy that takes an observation ot and responds with a probability distribution over actions to take at.
The policy is usually written as π, and the probability that π outputs an action a in response to an observation o is π(a|o). In practise, π is usually a neural network that takes observations as input and has actions as output (using something like a softmax layer to give a probability distribution); the parameters of this neural network are θ, and the corresponding policy is πθ.
At the end of the game, the entire trajectory τ=o1a1o2a2…oTaT is assigned a score, R(τ), measuring how well the agent has done. The goal is to find the policy πθ that maximises this score.
Since we’re using machine learning to maximise, we should be thinking of gradient descent, which involves finding the local direction in which to change the parameters θ in order to increase the expected value of R by the greatest amount, and then increasing them slightly in that direction.
In other words, we want to find ∇θEτ∼πθ[R(τ)].
Writing the expectation value in terms of a sum over trajectories, this is ∇θ∑τ∈D(P(τ|θ)R(τ)) = ∑τ∈D(∇θP(τ|θ)R(τ)), where P(τ|θ) is the probability of observing the trajectory τ if the agent follows the policy πθ, and D is the space of possible trajectories.
The probability of seeing a specific trajectory happen is the product of the probabilities of any individual step on the trajectory happening, and is hence P(τ|θ)=∏Tt=1πθ(at|ot)E(ot|at−1ot−1) where E(ot+1|atot) is the probability that the environment outputs the observation ot+1 in response to the observation-action pair atot . Products are awkward to work with, but products can be turned into sums by taking the logarithm - lnP(τ|θ)=∑Tt=1lnπθ(at|ot)+lnE(ot|at−1ot−1) .
The gradient of this is ∇θlnP(τ|θ)=∑Tt=1∇θlnπθ(at|ot)+∇θlnE(ot|at−1ot−1) . But what the environment does is independent of θ, so that entire term vanishes, and we have ∇θlnP(τ|θ)=∑Tt=1∇θlnπθ(at|ot) . The gradient of the policy is quite easy to find, since our policy is just a neural network so you can use back-propagation.
Our expression for the expectation value is just in terms of the gradient of the probability, not the gradient of the logarithm of the probability, so we’d like to express one in terms of the other.
Conveniently, the chain rule gives ∇θlnP(τ|θ)=1P(τ|θ)∇θP(τ|θ) , so ∇θP(τ|θ)=P(τ|θ)∇θlnP(τ|θ) . Substituting this back into the original expression for the gradient gives
∑τ∈D(P(τ|θ)∇θlnP(τ|θ)R(τ)) ,
and substituting our expression for the gradient of the logarithm of the probability gives
∑τ∈D(P(τ|θ)∑Tt=1∇θlnπθ(at|ot)R(τ)) .
Notice that this is the definition of the expectation value of ∇θlnπθ(at|ot)R(τ) , so writing the sum as an expectation value again we get
Eτ∼πθ[∑Tt=1∇θlogπθ(at|st)R(τ)].
You can then find this expectation value easily by sampling a large number of trajectories (by running the agent in the environment many times), calculating the term inside the brackets, and then averaging over all of the runs.
Neat!
(More sophisticated RL algorithms apply various transformations to the reward to use information more efficiently, and use various gradient descent tricks to use the gradients acquired to converge on the optimal parameters more efficiently)
Over the past few days I’ve been reading about reinforcement learning, because I understood how to make a neural network, say, recognise handwritten digits, but I wasn’t sure how at all that could be turned into getting a computer to play Atari games. So: what I’ve learned so far. Spinning Up’s Intro to RL probably explains this better.
(Brief summary, explained properly below: The agent is a neural network which runs in an environment and receives a reward. Each parameter in the neural network is increased in proportion to how much it increases the probability of making the agent do what it just did, and how good the outcome of what the agent just did was.)
Reinforcement learners play inside a game involving an agent and an environment. On turn t, the environment hands the agent an observation ot, and the agent hands the environment an action at. For an agent acting in realtime, there can be sixty turns a second; this is fine.
The environment has a transition function which takes an observation-action pair otat and responds with a probability distribution over observations on the next timestep ot+1; the agent has a policy that takes an observation ot and responds with a probability distribution over actions to take at.
The policy is usually written as π, and the probability that π outputs an action a in response to an observation o is π(a|o). In practise, π is usually a neural network that takes observations as input and has actions as output (using something like a softmax layer to give a probability distribution); the parameters of this neural network are θ, and the corresponding policy is πθ.
At the end of the game, the entire trajectory τ=o1a1o2a2…oTaT is assigned a score, R(τ), measuring how well the agent has done. The goal is to find the policy πθ that maximises this score.
Since we’re using machine learning to maximise, we should be thinking of gradient descent, which involves finding the local direction in which to change the parameters θ in order to increase the expected value of R by the greatest amount, and then increasing them slightly in that direction.
In other words, we want to find ∇θEτ∼πθ[R(τ)].
Writing the expectation value in terms of a sum over trajectories, this is ∇θ∑τ∈D(P(τ|θ)R(τ)) = ∑τ∈D(∇θP(τ|θ)R(τ)), where P(τ|θ) is the probability of observing the trajectory τ if the agent follows the policy πθ, and D is the space of possible trajectories.
The probability of seeing a specific trajectory happen is the product of the probabilities of any individual step on the trajectory happening, and is hence P(τ|θ)=∏Tt=1πθ(at|ot)E(ot|at−1ot−1) where E(ot+1|atot) is the probability that the environment outputs the observation ot+1 in response to the observation-action pair atot . Products are awkward to work with, but products can be turned into sums by taking the logarithm - lnP(τ|θ)=∑Tt=1lnπθ(at|ot)+lnE(ot|at−1ot−1) .
The gradient of this is ∇θlnP(τ|θ)=∑Tt=1∇θlnπθ(at|ot)+∇θlnE(ot|at−1ot−1) . But what the environment does is independent of θ, so that entire term vanishes, and we have ∇θlnP(τ|θ)=∑Tt=1∇θlnπθ(at|ot) . The gradient of the policy is quite easy to find, since our policy is just a neural network so you can use back-propagation.
Our expression for the expectation value is just in terms of the gradient of the probability, not the gradient of the logarithm of the probability, so we’d like to express one in terms of the other.
Conveniently, the chain rule gives ∇θlnP(τ|θ)=1P(τ|θ)∇θP(τ|θ) , so ∇θP(τ|θ)=P(τ|θ)∇θlnP(τ|θ) . Substituting this back into the original expression for the gradient gives
∑τ∈D(P(τ|θ)∇θlnP(τ|θ)R(τ)) ,
and substituting our expression for the gradient of the logarithm of the probability gives
∑τ∈D(P(τ|θ)∑Tt=1∇θlnπθ(at|ot)R(τ)) .
Notice that this is the definition of the expectation value of ∇θlnπθ(at|ot)R(τ) , so writing the sum as an expectation value again we get
Eτ∼πθ[∑Tt=1∇θlogπθ(at|st)R(τ)].
You can then find this expectation value easily by sampling a large number of trajectories (by running the agent in the environment many times), calculating the term inside the brackets, and then averaging over all of the runs.
Neat!
(More sophisticated RL algorithms apply various transformations to the reward to use information more efficiently, and use various gradient descent tricks to use the gradients acquired to converge on the optimal parameters more efficiently)