Rohin Shah comments on rohinmshah’s Shortform

Rohin Shah 19 Apr 2020 21:34 UTC
LW: 6 AF: 4
AF
The LESS is More paper (summarized in AN #96) makes the claim that using the Boltzmann model in sparse regions of demonstration-space will lead to the Boltzmann model over-learning. I found this plausible but not obvious, so I wanted to check it myself. (Partly I got nerd-sniped, partly I do want to keep practicing my ability to tell when things are formalizable theorems.) This benefited from discussion with Andreea (one of the primary authors).
Let’s consider a model where there are clusters ${c_{i}}$ , where each cluster contains trajectories whose features are identical $c_{i} = {τ : ϕ (τ) = ϕ_{c_{i}}}$ (which also implies rewards are identical). Let $c (τ)$ denote the cluster that $τ$ belongs to. The Boltzmann model says $p (τ ∣ θ) = \frac{exp (R_{θ} (τ))}{\sum_{τ^{'}} exp (R_{θ} (τ^{'}))}$ . The LESS model says $p (τ ∣ θ) = \frac{exp (R_{θ} (c (τ)))}{\sum_{c^{'}} exp (R_{θ} (c^{'}))} \cdot \frac{1}{| c (τ) |}$ , that is, the human chooses a cluster noisily based on the reward, and then uniformly at random chooses a trajectory from within that cluster.
(Note that the paper does something more suited to realistic situations where we have a similarity metric instead of these “clusters”; I’m introducing them as a simpler situation where we can understand what’s going on formally.)
In this model, a “sparse region of demonstration-space” is a cluster $c$ with small cardinality $| c |$ , whereas a dense one has large $| c |$ .
Let’s first do some preprocessing. We can rewrite the Boltzmann model as follows:
$p (τ ∣ θ) = \frac{exp (R_{θ} (τ))}{\sum_{τ^{'}} exp (R_{θ} (τ^{'}))} = \frac{exp (R_{θ} (c (τ)))}{\sum_{c^{'}} exp (R_{θ} (c^{'})) \cdot | c^{'} |} = \frac{| c (τ) | \cdot exp (R_{θ} (c (τ)))}{\sum_{c^{'}} | c^{'} | \cdot exp (R_{θ} (c^{'}))} \cdot \frac{1}{| c (τ) |}$
This allows us to write both models as first selecting a cluster, and then choosing randomly within the cluster:
$p (τ ∣ θ) = \frac{p (c (τ)) \cdot exp (R_{θ} (c (τ)))}{\sum_{c^{'}} p (c^{'}) exp (R_{θ} (c^{'}))} \cdot \frac{1}{| c (τ) |}$
Where for LESS $p (c)$ is uniform i.e. $p (c) \propto 1$ , whereas for Boltzmann $p (c) \propto | c |$ , i.e. a denser cluster is more likely to be sampled.
So now let us return to the original claim that the Boltzmann model overlearns in sparse areas. We’ll assume that LESS is the “correct” way to update (which is what the paper is claiming); in this case the claim reduces to saying that the Boltzmann model updates the posterior over rewards in the right direction but with too high a magnitude.
The intuitive argument for this is that the Boltzmann model assigns a lower likelihood to sparse clusters, since its “prior” over sparse clusters is much smaller, and so when it actually observes this low-likelihood event, it must update more strongly. However, this argument doesn’t work—it only claims that $p_{B o l t z m a n n} (τ) < p_{L E S S} (τ)$ , but in order to do a Bayesian update you need to consider likelihood ratios. To see this more formally, let’s look at the reward learning update:
$p (θ ∣ τ) = \frac{p (θ) \cdot p (τ ∣ θ)}{\sum_{θ^{'}} p (θ^{'}) \cdot p (τ ∣ θ^{'})} = \frac{p (θ) \cdot \frac{exp (R_{θ} (c (τ)))}{\sum_{c^{'}} p (c^{'}) exp (R_{θ} (c^{'}))}}{\sum_{θ^{'}} p (θ^{'}) \cdot \frac{exp (R_{θ^{'}} (c (τ)))}{\sum_{c^{'}} p (c^{'}) exp (R_{θ^{'}} (c^{'}))}}$ .
In the last step, any linear terms in $p (τ ∣ θ)$ that didn’t depend on $θ$ cancelled out. In particular, the prior over the selected class canceled out (though the prior did remain in normalizer / denominator, where it can still affect things). But the simple argument of “the prior is lower, therefore it updates more strongly” doesn’t seem to be reflected here.
Also, as you might expect, once we make the shift to thinking of selecting a cluster and then selecting a trajectory randomly, it no longer matters which trajectory you choose—the only relevant information is the cluster chosen (you can see this in the update above, where the only thing you do with the trajectory is to see which cluster $c (τ)$ it is in). So from now on I’ll just talk about selecting clusters, and updating on them. I’ll also write $E R_{θ} (c) = exp (R_{θ} (c))$ for conciseness.
$p (θ ∣ c) = \frac{p (θ) \cdot \frac{E R_{θ} (c)}{\sum_{c^{'}} p (c^{'}) E R_{θ} (c^{'})}}{\sum_{θ^{'}} p (θ^{'}) \cdot \frac{E R_{θ^{'}} (c)}{\sum_{c^{'}} p (c^{'}) E R_{θ^{'}} (c^{'})}}$ .
This is a horrifying mess of an equation. Let’s switch to odds:
$\frac{p (θ_{1} ∣ c)}{p (θ_{2} ∣ c)} = \frac{p (θ_{1})}{p (θ_{2})} \cdot \frac{E R_{θ_{1}} (c)}{E R_{θ_{2}} (c)} \cdot \frac{\sum_{c^{'}} p (c^{'}) E R_{θ_{2}} (c^{'})}{\sum_{c^{'}} p (c^{'}) E R_{θ_{1}} (c^{'})}$
The first two terms are the same across Boltzmann and LESS, since those only differ in their choice of $p (c)$ . So let’s consider just that last term. Denoting the vector of priors on all classes as $\to p$ , and similarly the vector of exponentiated rewards as $\to E R_{θ}$ , the last term becomes $\frac{\to p \cdot \to E R_{θ_{2}}}{\to p \cdot \to E R_{θ_{1}}} = \frac{| \to E R_{θ_{2}} |}{| \to E R_{θ_{1}} |} \cdot \frac{cos (α_{2})}{cos (α_{1})}$ , where $α_{i}$ is the angle between $\to p$ and $\to E R_{θ_{i}}$ . Again, the first term doesn’t differ between Boltzmann and LESS, so the only thing that differs between the two is the ratio $\frac{cos (α_{2})}{cos (α_{1})}$ .
What happens when the chosen class $c$ is sparse? Without loss of generality, let’s say that $E R_{θ_{1}} (c) > E R_{θ_{2}} (c)$ ; that is, $θ_{1}$ is a better fit for the demonstration, and so we will update towards it. Since $c$ is sparse, $p (c)$ is smaller for Boltzmann than for LESS—which probably means that it is better aligned with $θ_{2}$ , which also has a low value of $E R_{θ_{2}} (c)$ by assumption. (However, this is by no means guaranteed.) In this case, the ratio $\frac{cos (α_{2})}{cos (α_{1})}$ above would be higher for Boltzmann than for LESS, and so it would more strongly update towards $θ_{1}$ , supporting the claim that Boltzmann would overlearn rather than underlearn when getting a demo from the sparse region.
(Note it does make sense to analyze the effect on the $θ$ that we update towards, because in reward learning we care primarily about the $θ$ that we end up having higher probability on.)
What links here?
- [AN #96]: Buck and I discuss/argue about AI Alignment by Rohin Shah (22 Apr 2020 17:20 UTC; 17 points)