LawrenceC comments on Explained Simply: Quantilizers

LawrenceC 8 Sep 2023 23:51 UTC
4 points
0
It does leave one question — how do we make a list of possible actions in the first place?

I mean, the way you’d implement a quantilizer nowadays looks like: train a policy (e.g. an LLM, or a human imitator) that you think is safe. Then you can estimate what a X percentile value is via sampling (and some sort of reward model), and then perform rejection sampling to output actions that are have value greater than the X percentile action.
A simpler way to implement a thing that’s almost as good is to sample N actions, and then take the best of those N actions. (You can also do things like, sample randomly from the top X percentile of the N actions.)
Quantilizers are a proposed safer approach to AI goals. By randomly choosing from a selection of the top options, they avoid extreme behaviors that could cause harm. More research is needed, but quantilizers show promise as a model for the creation of AI systems that are beneficial but limited in scope.
I think the important part of the quantilizer work was not the idea that you should regularize policies to be closer to some safe policy (in fact, modern RLHF has a term in its reward that encourages minimizing the KL divergence between the current policy and the base policy). Instead, I think the important part was is Theorem 2 (the Quantilizer optimality theorem). In english, it says something like:
- If you don’t know anything about the true cost function except that a base policy gets bounded expected loss wrt it, then you can’t do better than quantilization for optimizing reward subject to a constraint on worst-case expected cost.
So if you end up in the situation where it’s easy to specify a knowably safe policy, but hard to specify any information about the cost whatsoever (except that cost is always non-negative), you might as well implement something like quantilization to be safe.
Note that BoN and (perfect^[1]) RL with KL constraints also satisfy other optimality criteria that can be framed similarly.
1. ^
  In practice, existing RL algorithms like PPO seem to have difficulty reaching the Pareto frontier of Reward vs KL, so they don’t satisfy the corresponding optimality criterion.