Standard AI are optimizers: they ‘look’ through possible actions they could take, and pick the one that maximises what they care about. This can be dangerous— an AI which maximises in this way needs to care about exactly the same things that humans care about, which is really hard[1]. If you tell a human to calculate as many digits of pi as possible within a year, they’ll do ‘reasonable’ things towards that goal. An optimizing AI might work out that it could calculate many more digits in a year by taking over another supercomputer— as this is the most effective action, it seems very attractive to the AI.
Quantilizers are a different approach. Instead of maximizing, they randomly choose from a few of the most effective possible actions:
They work like this:
Start with a goal, and a set of possible actions
Predict how useful each action will be for achieving the goal
Rank the actions from the most to the least useful
Pick randomly from the highest fraction only (i.e, the top 10%)
This avoids cases where the AI chooses extreme actions to maximize the goal. The AI chooses somewhat helpful actions instead.
It does leave one question — how do we make a list of possible actions in the first place? One suggestion is to ask a lot of humans to solve the task and train an AI to generate possible things it thinks humans would do. This list can then be used as an input to our quantilizer.
This does make them less effective, of course— firstly by picking less effective actions overall, and secondly by picking actions it thinks humans would take. But this might be worth the reduced risks— indeed, based on your risk tolerance, you can change the % of top actions the quantilizer will consider to make it more effective and riskier or vice versa.
So quantilizers trade some capability in exchange for greater safety, and avoid unintended consequences. They pick from lots of mild actions and very few extreme actions, so the chance of them doing something extreme or unexpected is miniscule.
Quantilizers are a proposed safer approach to AI goals. By randomly choosing from a selection of the top options, they avoid extreme behaviors that could cause harm. More research is needed, but quantilizers show promise as a model for the creation of AI systems that are beneficial but limited in scope. They provide an alternative to goal maximization, which can be dangerous, though they’re just theoretical right now.
- ^
Humans care about an awful lot of different things, even just one human!
In my other aligning-a-human-level-intelligence project (parenting), my kids get “points” for trying new foods. We are often having arguments about what kinds of trivial modifications to an old food will make it count as a new food. This seems like it could have a similar problem—couldn’t a superintelligence generate thousands of non-substantive variations for an effective, dangerous action while electing not to do so for other actions?
Similarly, since the tails come apart, perhaps it would be better to sample from 85-95%ile actions instead of sampling from 90-100%ile actions.
I mean, the way you’d implement a quantilizer nowadays looks like: train a policy (e.g. an LLM, or a human imitator) that you think is safe. Then you can estimate what a X percentile value is via sampling (and some sort of reward model), and then perform rejection sampling to output actions that are have value greater than the X percentile action.
A simpler way to implement a thing that’s almost as good is to sample N actions, and then take the best of those N actions. (You can also do things like, sample randomly from the top X percentile of the N actions.)
I think the important part of the quantilizer work was not the idea that you should regularize policies to be closer to some safe policy (in fact, modern RLHF has a term in its reward that encourages minimizing the KL divergence between the current policy and the base policy). Instead, I think the important part was is Theorem 2 (the Quantilizer optimality theorem). In english, it says something like:
If you don’t know anything about the true cost function except that a base policy gets bounded expected loss wrt it, then you can’t do better than quantilization for optimizing reward subject to a constraint on worst-case expected cost.
So if you end up in the situation where it’s easy to specify a knowably safe policy, but hard to specify any information about the cost whatsoever (except that cost is always non-negative), you might as well implement something like quantilization to be safe.
Note that BoN and (perfect[1]) RL with KL constraints also satisfy other optimality criteria that can be framed similarly.
In practice, existing RL algorithms like PPO seem to have difficulty reaching the Pareto frontier of Reward vs KL, so they don’t satisfy the corresponding optimality criterion.
Thanks for this. Is there some reasoning about why “extreme” is minimized with this randomness (presumably some reason to expect that effectiveness is correlated with extremety)? And why randomize, rather than just picking the 95%tile option always?
Honestly this kinda feels like what LLM agents do… with the exception that LLM agents have been trained on a vast corpus including lots of fiction, so their definition of “actions humans would take” tend to be fairly skewed sometimes on some particular topics.
Suskever has an interesting theory on this that you might want to watch. he calls it the x and y data compression theory.