Tetraspace comments on Tetraspace Grouping’s Shortform

Tetraspace 12 Dec 2019 21:18 UTC
3 points
Over the past few days I’ve been reading about reinforcement learning, because I understood how to make a neural network, say, recognise handwritten digits, but I wasn’t sure how at all that could be turned into getting a computer to play Atari games. So: what I’ve learned so far. Spinning Up’s Intro to RL probably explains this better.
(Brief summary, explained properly below: The agent is a neural network which runs in an environment and receives a reward. Each parameter in the neural network is increased in proportion to how much it increases the probability of making the agent do what it just did, and how good the outcome of what the agent just did was.)
Reinforcement learners play inside a game involving an agent and an environment. On turn $t$ , the environment hands the agent an observation $o_{t}$ , and the agent hands the environment an action $a_{t}$ . For an agent acting in realtime, there can be sixty turns a second; this is fine.
The environment has a transition function which takes an observation-action pair $o_{t} a_{t}$ and responds with a probability distribution over observations on the next timestep $o_{t + 1}$ ; the agent has a policy that takes an observation $o_{t}$ and responds with a probability distribution over actions to take $a_{t}$ .
The policy is usually written as $π$ , and the probability that $π$ outputs an action $a$ in response to an observation $o$ is $π (a | o)$ . In practise, $π$ is usually a neural network that takes observations as input and has actions as output (using something like a softmax layer to give a probability distribution); the parameters of this neural network are $θ$ , and the corresponding policy is $π_{θ}$ .
At the end of the game, the entire trajectory $τ = o_{1} a_{1} o_{2} a_{2} \dots o_{T} a_{T}$ is assigned a score, $R (τ)$ , measuring how well the agent has done. The goal is to find the policy $π_{θ}$ that maximises this score.
Since we’re using machine learning to maximise, we should be thinking of gradient descent, which involves finding the local direction in which to change the parameters $θ$ in order to increase the expected value of $R$ by the greatest amount, and then increasing them slightly in that direction.
In other words, we want to find $\nabla_{θ} E τ \sim π_{θ} [R (τ)]$ .
Writing the expectation value in terms of a sum over trajectories, this is $\nabla_{θ} \sum_{τ \in D} (P (τ | θ) R (τ))$ = $\sum_{τ \in D} (\nabla_{θ} P (τ | θ) R (τ))$ , where $P (τ | θ)$ is the probability of observing the trajectory $τ$ if the agent follows the policy $π_{θ}$ , and $D$ is the space of possible trajectories.
The probability of seeing a specific trajectory happen is the product of the probabilities of any individual step on the trajectory happening, and is hence $P (τ | θ) = \prod_{t = 1}^{T} π_{θ} (a_{t} | o_{t}) E (o_{t} | a_{t - 1} o_{t - 1})$ where $E (o_{t + 1} | a_{t} o_{t})$ is the probability that the environment outputs the observation $o_{t + 1}$ in response to the observation-action pair $a_{t} o_{t}$ . Products are awkward to work with, but products can be turned into sums by taking the logarithm - $ln P (τ | θ) = \sum_{t = 1}^{T} ln π_{θ} (a_{t} | o_{t}) + ln E (o_{t} | a_{t - 1} o_{t - 1})$ .
The gradient of this is $\nabla_{θ} ln P (τ | θ) = \sum_{t = 1}^{T} \nabla_{θ} ln π_{θ} (a_{t} | o_{t}) + \nabla_{θ} ln E (o_{t} | a_{t - 1} o_{t - 1})$ . But what the environment does is independent of $θ$ , so that entire term vanishes, and we have $\nabla_{θ} ln P (τ | θ) = \sum_{t = 1}^{T} \nabla_{θ} ln π_{θ} (a_{t} | o_{t})$ . The gradient of the policy is quite easy to find, since our policy is just a neural network so you can use back-propagation.
Our expression for the expectation value is just in terms of the gradient of the probability, not the gradient of the logarithm of the probability, so we’d like to express one in terms of the other.
Conveniently, the chain rule gives $\nabla_{θ} ln P (τ | θ) = \frac{1}{P (τ | θ)} \nabla_{θ} P (τ | θ)$ , so $\nabla_{θ} P (τ | θ) = P (τ | θ) \nabla_{θ} ln P (τ | θ)$ . Substituting this back into the original expression for the gradient gives
$\sum_{τ \in D} (P (τ | θ) \nabla_{θ} ln P (τ | θ) R (τ))$ ,
and substituting our expression for the gradient of the logarithm of the probability gives
$\sum_{τ \in D} (P (τ | θ) \sum_{t = 1}^{T} \nabla_{θ} ln π_{θ} (a_{t} | o_{t}) R (τ))$ .
Notice that this is the definition of the expectation value of $\nabla_{θ} ln π_{θ} (a_{t} | o_{t}) R (τ)$ , so writing the sum as an expectation value again we get
$E τ \sim π_{θ} [\sum_{t = 1}^{T} \nabla_{θ} log π_{θ} (a_{t} | s_{t}) R (τ)]$ .
You can then find this expectation value easily by sampling a large number of trajectories (by running the agent in the environment many times), calculating the term inside the brackets, and then averaging over all of the runs.
Neat!
(More sophisticated RL algorithms apply various transformations to the reward to use information more efficiently, and use various gradient descent tricks to use the gradients acquired to converge on the optimal parameters more efficiently)