Goodhart in RL with KL: Appendix

This is the appendix to the previous post on Goodhart’s Law and KL regularization, containing all of our proofs.

Theorem about distributions

Theorem 1: Given any heavy-tailed reference distribution over with mean , and any , there is a distribution with mean and .

Proof: WLOG let . We construct a sequence of distributions such that for any constant , and . We define for any thusly. Writing for the CDF and for , we let

Intuitively, we rescale the part of the distribution to the right of evenly to have total probability , which is less than 1 because .

We must check that . We can write

We know that because it is an integral of values strictly greater than t. Because is a weighted average of and , and , we know . So We also know that for sufficiently large , . Intuitively, starting from , which has mean 0, moves a probability mass approaching from mean <0 to mean >t.

Now we can say

Because has a finite mean, , and so .

Now we check that :

Since both and go to as , the left term goes to , and so

is heavy tailed, so by definition . This implies that for every there is a sufficiently large so that for all , , which means that .

Therefore for every , , which since KL divergence is nonnegative means that as desired.

Theorem about deterministic Markovian-return MDPs

Definition: A deterministic-transition MDP with Markovian returns (DMRMDP) is an MDP such that:

  • The transition function is deterministic, i.e., for each state and action , there exists a unique state such that .

  • There is a set of sink states that terminate a trajectory, which is disjoint with the set of start states.

  • Returns are Markovian; that is, for any two trajectories if , then and have identical return distributions. Equivalently, for the trajectory random variable distributed according to any policy, with return , for any .

Note: Sampling from a language model and applying RLHF is well-modeled as a DMRMDP, since the state is a sequence of tokens (actions) which deterministically results from the last token and returns depend only on the final state.

Theorem 2: Let be a deterministic-transition MDP with Markovian returns. Given we define the function that takes policies to trajectories , and the average return function which induces a function . Let be some reference policy. If is heavy-tailed with finite mean , then for any , there is a policy with mean return and .

Proof: We will exhibit a distribution of trajectories such that and , and then construct a policy with . Note that this proof applies for continuous action spaces if trajectories are replaced with measurable sets, but this would make it harder to read.

Let . We have a heavy-tailed distribution of return over , so we can apply Theorem 1. But to define , we can construct in the proof of Theorem 1 in a particular way. For any , we need a that uniformly upweights values of mean return such that . We can define such that any trajectory is upweighted by a factor depending only on its mean return:

Then we can let and the rest of the proof of Theorem 1 applies. Therefore, applying the theorem, we can let for sufficiently large , and then and . But by the chain rule for KL divergence, . Since we constructed so that the probabilities of each conditional on its return being are equal, the second term is zero, and we also have .

Finally, since the KL divergence between trajectory distributions is the sum of KL divergence between policies at each action in the trajectory, and each trajectory has at least one action, as desired.

To define such that , we let .

Then the probability that any trajectory is sampled is:

In (2), returns are Markovian, so all trajectory prefixes ending in state have the same distribution of returns under any policy. In the construction of , all trajectories with the same mean return have equal measure. Therefore, conditioning on earlier states and actions of does not change the measure, so we can write (3). So as desired.

Lagrange multipliers to minimize KL divergence

Theorem 3: If is light-tailed, is finite, and is bounded, then is bounded, and as .

Using Lagrange multipliers, we find that when KL divergence is minimized, we have for some constants , so

That is, the new PDF is an exponential tilting of the old PDF. Now what is ? It’s just . If the distribution of V is heavy-tailed distribution, this is ; if it is light-tailed, this is some finite value.

When , and are identical and . So by a continuity argument, as .

Light tails + independence imply

Theorem 4: If with and both light-tailed, and the distribution of U is continuous, and , then .

Proof: Fix some . Using Lagrange multipliers, we find that for any event , . Let be the median value of under the policy ; that is, This exists because has a continuous distribution. Then:

The left term is , while the right term is , so the overall limit is .

No comments.