It seems to me that the main goal of quantilization is to reduce the extreme unintended outcomes of maximizing (by sampling from something like a human-learned distribution over actions) while still remaining competitive (by sampling from only the upper quantile of said distribution).
But that still leaves open the possibility of sampling one of those super highly maximized actions! Why not just hard-cut off the upper portion of the distribution as well? Why not have a quantilizer that samples between the 90%ile and the 99%ile?
Or, while we’re at it, why are we sampling at all? Why not just say that we’re taking the 99%ile action?
Quantilizing can be thought of as maximizing a lower bound on the expected true utility, where you know that your true utility V is close to your proxy utility function U in some region γ, such that Ea∼γ|U(a)−V(a)|≤ϵ. If we shape this closeness assumption a bit differently, such that the approximation gets worse faster, then sometimes it can be optimal to cut off the top of the distribution (as I did here, see some of the diagrams for quantilizers with the top cut off, I’ll paste one below).
The reason normal quantilizers don’t do that is that they are minimizing the distance between γ and the action distribution, by a particular measure that falls out of the proof (see above link), which allows the lower bound to be as high as possible. Essentially it’s minimizing distribution shift, which allows a better generalization bound.
I think this distribution shift perspective is one way of explaining why we need randomization at all. A delta function is a bigger distribution shift than a distribution that matches the shape of γ.
But the next question is why are we even in a situation where we need to deal with the worst case across possible true utility functions? One story is that we are dealing with an optimizer that is maximizing trueutility + error, and one way to simplify that is to model it as max min trueutility—error, where the min only controls the error function within the restrictions of the known bound.
I’m not currently super happy with that story and I’m keen for people to look for alternatives, or variations of soft optimization with different types of knowledge about the relationship between the proxy and true utility. Because intuitively it does seem like taking the 99%ile action should be fine under slightly different assumptions.
One example of this is if we know that U=V+e, where e is some heavy tailed noise, and we know the distribution of e (and V), then we can calculate the actual optimal percentile action to take, and we should deterministically take that action. But this is sometimes quite sensitive to small errors in our knowledge about the distribution of e and particularly V. My AISC team has been testing scenarios like this as part of their research.
Yes, this is fine to do, and prevents single-shot problems if you have a particular picture of the distribution over outcomes where most disastrous risk comes from edge cases that get 99.99%ile score but are actually bad, and all we need is actions that are 99th percentile.
This is fine if you want your AI to stack blocks on top of other blocks or something.
But unfortunately when you want to use a quantilizer to do something outside the normal human distribution, like cure cancer or supervise the training of a superhuman AI, you’re no longer just shooting for a 99%ile policy. You want the AI to do something unlikely, and so of course, you can’t simultaneously restrict it to only do likely things.
Now, you can construct a policy where each individual action is exactly 99%ile on some distribution of human actions. The results quickly go off-distribution and end up somewhere new—chaining together 100 99%ile actions is not the same as picking a 99%ile policy. This means you don’t get the straightforward quantilizer-style guarantee, but maybe it’s not all bad—after all, we wanted to go off distribution to cure cancer, and maybe by taking actions with some kind of moderation we can get some other kind of benefits we don’t quite understand yet. I’m reminded of Peli Grietzer’s recent post on virtue ethics.