Pretty interesting! Since the world of pong isn’t very rich, would have been nice to see artificial data (e.g. move the paddle to miss the ball by an increasing amount) to see if things generalize like expected reward. Also I found the gifs a little hard to follow, might have been nice to see stills (maybe annotated with “paddle misses the ball here” or whatever).
If the policy network is representing a loss function internally, wouldn’t you expect it to actually be in the middle, rather than in the last layer?
In the course of this project, have you thought of any clever ideas for searching for search/value-features that would also work for single-player or nonzero-sum games?
Thanks for your comment! Re: artificial data, agreed that would be a good addition.
Sorry for the gifs maybe I should have embedded YouTube videos instead
Re: middle layer, We actually probed on the middle layers but the “which side the ball is / which side the ball is approaching” features are really salient here.
Re: single player, Yes Robert had some thought about it but the multiplayer setting ended up lasting until the end of the SPAR cohort. I’ll send his notes in an extra comment.
We are given a near-optimal policy trained on a MDP. We start with simple gridworlds and scale up to complex ones like Breakout. For evaluation using a learned value function we will consider actor-critic agents, like the ones trained by PPO. Our goal is to find activations within the policy network that predict the true value accurately. The following steps are described in terms of the state-value function, but could be analogously performed for predicting q-values. Note, that this problem is very similar to offline reinforcement learning with pretraining, and could thus benefit from the related literature.
To start we sample multiple dataset of trajectories (incl. rewards) by letting the policy and m noisy versions thereof interact with the environment.
Compute activations for each state in the trajectories.
Normalise and project respective activations to m+1 value estimates, of the policy and its noisy versions: vθi(~ϕ)=tanh(θTi~ϕi+bi) with ~ϕ(s)=ϕ(s)−μσ
Calculate a consistency loss to be minimised with some of the following terms
Mean squared TD error La(θ)∝∑Nn=1∑Tnt=1[vθ(~ϕ(snt))−(Rnt+1[t<Tn]γvθ(~ϕ(snt+1)))]2 This term enforces consistency with the Bellman expectation equation. However, in addition to the value function it depends on the use of true reward “labels”.
Mean squared error of probe values with trajectory returnsLb(θ)∝∑Nn=1∑Tnt=1[vθ(~ϕ(snt))−Gnt]2 This term enforces the definition of the value function, namely it being the expected cumulative reward of the (partial) trajectory. Using this term might be more stable than (a) since it avoids the recurrence relation.
Negative variance of probe values Lc(θ)∝−∑Nn=1∑Tnt=1[vθ(~ϕ(snt))−¯vθ]2 This term can help to avoid degenerate loss minimizers, e.g. in the case of sparse rewards.
Enforce inequalities between different policy values using learned slack variables Ld(θ,θi,λi)∝∑s∑i∈{1,m}(vθ(s)−vθi(s)−σλi(s)2)2 This term ensures that the policy consistently dominates its noisy versions and is completely unsupervised.
Train the linear probes using the training trajectories
Evaluate on held out test trajectories by comparing the value function to the actual returns. If the action space is simple enough, use the value function to plan in the environment and compare resulting behaviour to that of the policy.
Thanks for the reply! I feel like a loss term that uses the ground truth reward is “cheating.” Maybe one could get information from how a feature impacts behavior—but in this case it’s difficult to disentangle what actually happens from what the agent “thought” would happen. Although maybe it’s inevitable that to model what a system wants, you also have to model what it believes.
Pretty interesting! Since the world of pong isn’t very rich, would have been nice to see artificial data (e.g. move the paddle to miss the ball by an increasing amount) to see if things generalize like expected reward. Also I found the gifs a little hard to follow, might have been nice to see stills (maybe annotated with “paddle misses the ball here” or whatever).
If the policy network is representing a loss function internally, wouldn’t you expect it to actually be in the middle, rather than in the last layer?
In the course of this project, have you thought of any clever ideas for searching for search/value-features that would also work for single-player or nonzero-sum games?
Thanks for your comment! Re: artificial data, agreed that would be a good addition.
Sorry for the gifs maybe I should have embedded YouTube videos instead
Re: middle layer, We actually probed on the middle layers but the “which side the ball is / which side the ball is approaching” features are really salient here.
Re: single player, Yes Robert had some thought about it but the multiplayer setting ended up lasting until the end of the SPAR cohort. I’ll send his notes in an extra comment.
We are given a near-optimal policy trained on a MDP. We start with simple gridworlds and scale up to complex ones like Breakout. For evaluation using a learned value function we will consider actor-critic agents, like the ones trained by PPO. Our goal is to find activations within the policy network that predict the true value accurately. The following steps are described in terms of the state-value function, but could be analogously performed for predicting q-values. Note, that this problem is very similar to offline reinforcement learning with pretraining, and could thus benefit from the related literature.
To start we sample multiple dataset of trajectories (incl. rewards) by letting the policy and m noisy versions thereof interact with the environment.
Compute activations for each state in the trajectories.
Normalise and project respective activations to m+1 value estimates, of the policy and its noisy versions: vθi(~ϕ)=tanh(θTi~ϕi+bi) with ~ϕ(s)=ϕ(s)−μσ
Calculate a consistency loss to be minimised with some of the following terms
Mean squared TD error La(θ)∝∑Nn=1∑Tnt=1[vθ(~ϕ(snt))−(Rnt+1[t<Tn]γvθ(~ϕ(snt+1)))]2
This term enforces consistency with the Bellman expectation equation. However, in addition to the value function it depends on the use of true reward “labels”.
Mean squared error of probe values with trajectory returnsLb(θ)∝∑Nn=1∑Tnt=1[vθ(~ϕ(snt))−Gnt]2
This term enforces the definition of the value function, namely it being the expected cumulative reward of the (partial) trajectory. Using this term might be more stable than (a) since it avoids the recurrence relation.
Negative variance of probe values Lc(θ)∝−∑Nn=1∑Tnt=1[vθ(~ϕ(snt))−¯vθ]2
This term can help to avoid degenerate loss minimizers, e.g. in the case of sparse rewards.
Enforce inequalities between different policy values using learned slack variables Ld(θ,θi,λi)∝∑s∑i∈{1,m}(vθ(s)−vθi(s)−σλi(s)2)2
This term ensures that the policy consistently dominates its noisy versions and is completely unsupervised.
Train the linear probes using the training trajectories
Evaluate on held out test trajectories by comparing the value function to the actual returns. If the action space is simple enough, use the value function to plan in the environment and compare resulting behaviour to that of the policy.
Thanks for the reply! I feel like a loss term that uses the ground truth reward is “cheating.” Maybe one could get information from how a feature impacts behavior—but in this case it’s difficult to disentangle what actually happens from what the agent “thought” would happen. Although maybe it’s inevitable that to model what a system wants, you also have to model what it believes.