It does leave one question — how do we make a list of possible actions in the first place?
I mean, the way you’d implement a quantilizer nowadays looks like: train a policy (e.g. an LLM, or a human imitator) that you think is safe. Then you can estimate what a X percentile value is via sampling (and some sort of reward model), and then perform rejection sampling to output actions that are have value greater than the X percentile action.
A simpler way to implement a thing that’s almost as good is to sample N actions, and then take the best of those N actions. (You can also do things like, sample randomly from the top X percentile of the N actions.)
Quantilizers are a proposed safer approach to AI goals. By randomly choosing from a selection of the top options, they avoid extreme behaviors that could cause harm. More research is needed, but quantilizers show promise as a model for the creation of AI systems that are beneficial but limited in scope.
I think the important part of the quantilizer work was not the idea that you should regularize policies to be closer to some safe policy (in fact, modern RLHF has a term in its reward that encourages minimizing the KL divergence between the current policy and the base policy). Instead, I think the important part was is Theorem 2 (the Quantilizer optimality theorem). In english, it says something like:
If you don’t know anything about the true cost function except that a base policy gets bounded expected loss wrt it, then you can’t do better than quantilization for optimizing reward subject to a constraint on worst-case expected cost.
So if you end up in the situation where it’s easy to specify a knowably safe policy, but hard to specify any information about the cost whatsoever (except that cost is always non-negative), you might as well implement something like quantilization to be safe.
Note that BoN and (perfect[1]) RL with KL constraints also satisfy other optimality criteria that can be framed similarly.
In practice, existing RL algorithms like PPO seem to have difficulty reaching the Pareto frontier of Reward vs KL, so they don’t satisfy the corresponding optimality criterion.
I mean, the way you’d implement a quantilizer nowadays looks like: train a policy (e.g. an LLM, or a human imitator) that you think is safe. Then you can estimate what a X percentile value is via sampling (and some sort of reward model), and then perform rejection sampling to output actions that are have value greater than the X percentile action.
A simpler way to implement a thing that’s almost as good is to sample N actions, and then take the best of those N actions. (You can also do things like, sample randomly from the top X percentile of the N actions.)
I think the important part of the quantilizer work was not the idea that you should regularize policies to be closer to some safe policy (in fact, modern RLHF has a term in its reward that encourages minimizing the KL divergence between the current policy and the base policy). Instead, I think the important part was is Theorem 2 (the Quantilizer optimality theorem). In english, it says something like:
If you don’t know anything about the true cost function except that a base policy gets bounded expected loss wrt it, then you can’t do better than quantilization for optimizing reward subject to a constraint on worst-case expected cost.
So if you end up in the situation where it’s easy to specify a knowably safe policy, but hard to specify any information about the cost whatsoever (except that cost is always non-negative), you might as well implement something like quantilization to be safe.
Note that BoN and (perfect[1]) RL with KL constraints also satisfy other optimality criteria that can be framed similarly.
In practice, existing RL algorithms like PPO seem to have difficulty reaching the Pareto frontier of Reward vs KL, so they don’t satisfy the corresponding optimality criterion.