Another view of quantilizers: avoiding Goodhart’s Law
Goodhart’s law states:
Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.
One way of framing this is that, when you are solving some optimization problem, a metric that is correlated with a desired objective will often stop being correlated with the objective when you look at the extreme values of the metric. For example, although the number of paperclips a paperclip factory produces tends to be correlated with how useful the factory is for its owner’s values, a paperclip factory that produces an extremely high number of paperclips is likely to be quite bad for its owner’s values.
Let’s try to formalize this. Suppose you are finding some that optimizes some unknown objective function , and you have some estimate which you believe to approximate . Specifically, you have a guarantee that, for some base distribution , does not incorrectly estimate much on average:
We might suppose that we only want to take actions if our expected is above zero; otherwise, it would be better to do nothing.
Given this, how do you pick an to guarantee a good objective value across all possible objective functions ? Naively, you might pick ; however, if this has a low probability under , then it is possible for to be much higher than without causing to overestimate much on average.
If is chosen adversarially, the optimization problem to solve is: where is the probability that the agent takes an action at all, and is the action distribution if it takes an action. Equivalently, since the most adversarial values will not ever be above : Define :
In fact, when , the solution to this optimization problem is a -quantilizer with utility function and base distribution , for some . The proof can be found in the “Optimality of quantilizers under the cost constraint” section of the post about quantilizers. will be set to 1 if and only if this quantilizer is guaranteed positive utility.
This provides another view of what quantilizers are doing. In effect, they are treating the “utility function” as an estimate of the true utility function that tends to be accurate on average across the base distribution , and conservatively optimizing given adversarial uncertainty about the true utility function .
- Soft optimization makes the value target bigger by 2 Jan 2023 16:06 UTC; 117 points) (
- Learning Normativity: A Research Agenda by 11 Nov 2020 21:59 UTC; 82 points) (
- Introduction to Reducing Goodhart by 26 Aug 2021 18:38 UTC; 48 points) (
- Using expected utility for Good(hart) by 27 Aug 2018 3:32 UTC; 42 points) (
- Goodhart Ethology by 17 Sep 2021 17:31 UTC; 20 points) (
- 30 Jun 2022 14:05 UTC; 4 points) 's comment on Formal Philosophy and Alignment Possible Projects by (
- 8 Apr 2023 14:25 UTC; 4 points) 's comment on All AGI Safety questions welcome (especially basic ones) [April 2023] by (
- 3 Sep 2022 2:49 UTC; 3 points) 's comment on Behaviour Manifolds and the Hessian of the Total Loss—Notes and Criticism by (
- Thoughts on Quantilizers by 2 Jun 2017 16:24 UTC; 2 points) (
- 6 Jul 2022 10:15 UTC; 2 points) 's comment on Decision theory and dynamic inconsistency by (
- Extending the stated objectives by 19 Jan 2016 15:46 UTC; 0 points) (
Happy new year.
I have just posted on LessWrong the result of my work on AI.
Your work on quantilisers is the core mathematical evidence of what I propose (the code is another).
I would really appreciate your opinion on it.
Kind regards
Not sure how closely related it is (I have not read through it), but here is another paper trying to fight Goodhart’s law: http://arxiv.org/abs/1506.06980