Epistemic status: most elements are not new, but the synthesis seems useful.
Here is an alignment protocol that I call “autocalibrated quantilzed debate” (AQD).
Arguably the biggest concern with naive debate[1] is that perhaps a superintelligent AI can attack a human brain in a manner that takes it out of the regime of quasi-rational reasoning altogether, in which case the framing of “arguments and counterargument” doesn’t make sense anymore. Let’s call utterances that have this property “Lovecraftian”. To counter this, I suggest using quantilization. Quantilization postulates that the true utility function is the difference between the proxy utility function and an unknown adversarial cost function with bounded expected value w.r.t. some base distribution. For debate, we can imagine the cost function to measure the extent to which a particular utterance is Lovecraftian. It also protects us from Cartesian daemons. There are several candidates for the base distribution:
Uniform
Some sort of bounded simplicity prior. This amounts to postulating that a Lovecraftian utterance is unlikely to be produced by a randomized algorithm of low description and computational complexity.
The distribution of utterances produced by human debaters, acquired via imitation learning (this requires hardening[2] against daemons)
The issue with this is that it might be hard to choose the quantilization parameter. An error in one direction produces a system which is too weak for pivotal acts. An error in the other direction produces an unsafe system. However, it seems likely that there is some range of parameters which is okay: arguably, the optimization power needed to produce a natural language description of a pivotal act[3] is not as big as the optimization power required for Lovecraftian utterances. Indeed, it seems unlikely we could design Lovecraftian utterances ourselves in the foreseeable future, so if solving alignment is not easier then it’s hopeless anyway.
To address this, I propose the following “autocalibration” procedure. Let α1,α2… be a sequence of quantilization parameters, starting from extremely aggressive (definitely safe) and then becoming gradually less aggressive (stronger optimization, weaker safety) s.t. we can be reasonably sure it doesn’t skip over the target range. For each i, we consider an αi-quantilized debate about the question “is replacing αi-quantilized debate by αi+1-quantilized debate a good idea?”. We can then imagine executing this sequence of debates until the point where the answer switches from “yes” to “no”.
Now, this sequence might be impractically time consuming. To address this, we can introduce a prediction system which predicts the outcomes of debates without having to actually run them. Notice that this is more viable than e.g. amplifying human imitation via prediction, because in this case we are predicting a low-dimensional signal whereas with imitation we have a high-dimensional signal with a much lower-dimensional but unknown subspace of relevant features. Ofc this system has to be hardened[2:1] against daemons. Once the optimal quantilization parameter has been determined, we can use it for a debate about the pivotal act.
There are various further nuances that should be considered:
Various ways to train the debaters can be considered, e.g. starting by generating good models of human judges[4] by passive observation and fine-tuning via actual debate.
The quantilized game is not exactly zero-sum so we need to rule out superrational cooperation, either by designing the debaters appropriately or via a unitary AI that is designed to search for Nash equilibria.
Instead of a discrete sequence, we can imagine continuous fine-tuning of the parameter i.e. finding the most aggressive quantilization parameter which is self-endorsing.
Various way to operationalize and structure the debates might be considered e.g. the 4 player system.
Finally, we also need to deal with non-Cartesian daemons. For this, let’s consider a “Turing” setting, i.e. a machine learning “core” connected to an external computer “envelope”. We assume that the core is not powerful enough to host a non-Cartesian daemon all by itself. We then protect ourselves by applying quantilization to the envelope actions as well. We might also be able to design a more detailed prior about dangerous programs than about Lovecraftian utterances. This requires to autocalibrate at least two parameters, which seems doable: for example, increase both parameters together until you hit the Pareto frontier, then optimize along the frontier. A weakness in this approach is that, for very aggressive quantilization the prediction system will become too weak to be useful, but perhaps this can be compensated by the more detailed prior.
To deal with Cartesian daemons we (i) only make predictions about the counterfactual in which the prediction wasn’t made (ii) only produce predictions when they pass a confidence threshold (iii) use a physicalist prior (to be defined in a future post). [EDIT: Actually, this doesn’t work] Non-Cartesian daemons are addressed below.
I’m not sure this attacks goodharting directly enough. Optimizing a system for proxy utility moves its state out-of-distribution where proxy utility generalizes training utility incorrectly. This probably holds for debate optimized towards intended objectives as much as for more concrete framings with state and utility.
Dithering across the border of goodharting (of scope of a proxy utility) with quantilization is actionable, but isn’t about defining the border or formulating legible strategies for what to do about optimization when approaching the border. For example, one might try for shutdown, interrupt-for-oversight, or getting-back-inside-the-borders when optimization pushes the system outside, which is not quantilization. (Getting-back-inside-the-borders might even have weird-x-risk prevention as a convergent drive, but will oppose corrigibility. Some version of oversight/amplification might facilitate corrigibility.)
Debate seems more useful for amplification, extrapolating concepts in a way humans would, in order to become acceptable proxies in wider scopes, so that more and more debates become non-lovecraftian. This is a different concern from setting up optimization that works with some fixed proxy concepts as given.
For debate, goodharting means producing an answer which can be defended successfully in front of the judge, even in the face of an opponent pointing out all the flaws, but which is nevertheless bad. My assumption here is: it’s harder to produce such an answer than producing a genuinely good (and defensible) answer. If this assumption holds, then there is a range of quantilization parameters which yields good answers.
For the question of “what is a good plan to solve AI risk”, the assumption seems solid enough since we’re not worried about coming across such deceptive plans on our own, and it’s hard to imagine humans producing one even on purpose. To the extent our search for plans relies mostly on our ability to evaluate arguments and find counterarguments, it seems like the difference between the former and the latter is not great anyway. This argument is especially strong if we use human debaters as baseline distribution, although in this case we are vulnerable to same competitiveness problem as amplified-imitation, namely that reliably predicting rich outputs might be infeasible.
For the question of “should we continue changing the quantilization parameter”, the assumption still holds because the debater arguing to stop at the given point can win by presenting a plan to solve AI risk which is superior to continuing to change the parameter.
Goodharting is about what happens in situations where “good” is undefined or uncertain or contentious, but still gets used for optimization. There are situations where it’s better-defined, and situations where it’s ill-defined, and an anti-goodharting agent strives to optimize only within scope of where it’s better-defined. I took “lovecraftian” as a proxy for situations where it’s ill-defined, and base distribution of quantilization that’s intended to oppose goodharting acts as a quantitative description of where it’s taken as better-defined, so for this purpose base distribution captures non-lovecraftian situations. Of the options you listed for debate, the distribution from imitation learning seems OK for this purpose, if amended by some anti-weirdness filters to exclude debates that can’t be reliably judged.
The main issues with anti-goodharting that I see is the difficulty of defining proxy utility and base distribution, the difficulty of making it corrigible, not locking-in into fixed proxy utility and base distribution, and the question of what to do about optimization that points out of scope.
My point is that if anti-goodharting and not development of quantilization is taken as a goal, then calibration of quantilization is not the kind of thing that helps, it doesn’t address the main issues. Like, even for quantilization, fiddling with base distribution and proxy utility is a more natural framing that’s strictly more general than fiddling with the quantilization parameter. If we are to pick a single number to improve, why privilege the quantilization parameter instead of some other parameter that influences base distribution and proxy utility?
The use of debates for amplification in this framing is for corrigibility part of anti-goodharting, a way to redefine utility proxy and expand the base distribution, learning from how the debates at the boundary of the previous base distribution go. Quantilization seems like a fine building block for this, sampling slightly lovecraftian debates that are good, which is the direction where we want to expand the scope.
The main issues with anti-goodharting that I see is the difficulty of defining proxy utility and base distribution, the difficulty of making it corrigible, not locking-in into fixed proxy utility and base distribution, and the question of what to do about optimization that points out of scope.
The proxy utility in debate is perfectly well-defined: it is the ruling of the human judge. For the base distribution I also made some concrete proposals (which certainly might be improvable but are not obviously bad). As to corrigibility, I think it’s an ill-posed concept. I’m not sure how you imagine corrigibility in this case: AQD is a series of discrete “transactions” (debates), and nothing prevents you from modifying the AI between one and another. Even inside a debate, there is no incentive in the outer loop to resist modifications, whereas daemons would be impeded by quantilization. The “out of scope” case is also dodged by quantilization, if I understand what you mean by “out of scope”.
...fiddling with base distribution and proxy utility is a more natural framing that’s strictly more general than fiddling with the quantilization parameter.
Why is it strictly more general? I don’t see it. It seems false, since for extreme value of the quantilization parameter we get optimization which is deterministic and hence cannot be equivalent to quantilization with different proxy and distribution.
If we are to pick a single number to improve, why privilege the quantilization parameter instead of some other parameter that influences base distribution and proxy utility?
The reason to pick the quantilization parameter is because it’s hard to determine, as opposed to the proxy and base distribution[1] for which there are concrete proposals with more-or-less clear motivation.
I don’t understand which “main issues” you think this doesn’t address. Can you describe a concrete attack vector?
If the base distribution is a bounded simplicity prior then it will have some parameters, and this is truly a weakness of the protocol. Still, I suspect that safety is less sensitive to these parameters and it is more tractable to determine them by connecting our ultimate theories of AI with brain science (i.e. looking for parameters which would mimic the computational bounds of human cognition).
Epistemic status: most elements are not new, but the synthesis seems useful.
Here is an alignment protocol that I call “autocalibrated quantilzed debate” (AQD).
Arguably the biggest concern with naive debate[1] is that perhaps a superintelligent AI can attack a human brain in a manner that takes it out of the regime of quasi-rational reasoning altogether, in which case the framing of “arguments and counterargument” doesn’t make sense anymore. Let’s call utterances that have this property “Lovecraftian”. To counter this, I suggest using quantilization. Quantilization postulates that the true utility function is the difference between the proxy utility function and an unknown adversarial cost function with bounded expected value w.r.t. some base distribution. For debate, we can imagine the cost function to measure the extent to which a particular utterance is Lovecraftian. It also protects us from Cartesian daemons. There are several candidates for the base distribution:
Uniform
Some sort of bounded simplicity prior. This amounts to postulating that a Lovecraftian utterance is unlikely to be produced by a randomized algorithm of low description and computational complexity.
The distribution of utterances produced by human debaters, acquired via imitation learning (this requires hardening[2] against daemons)
The issue with this is that it might be hard to choose the quantilization parameter. An error in one direction produces a system which is too weak for pivotal acts. An error in the other direction produces an unsafe system. However, it seems likely that there is some range of parameters which is okay: arguably, the optimization power needed to produce a natural language description of a pivotal act[3] is not as big as the optimization power required for Lovecraftian utterances. Indeed, it seems unlikely we could design Lovecraftian utterances ourselves in the foreseeable future, so if solving alignment is not easier then it’s hopeless anyway.
To address this, I propose the following “autocalibration” procedure. Let α1,α2… be a sequence of quantilization parameters, starting from extremely aggressive (definitely safe) and then becoming gradually less aggressive (stronger optimization, weaker safety) s.t. we can be reasonably sure it doesn’t skip over the target range. For each i, we consider an αi-quantilized debate about the question “is replacing αi-quantilized debate by αi+1-quantilized debate a good idea?”. We can then imagine executing this sequence of debates until the point where the answer switches from “yes” to “no”.
Now, this sequence might be impractically time consuming. To address this, we can introduce a prediction system which predicts the outcomes of debates without having to actually run them. Notice that this is more viable than e.g. amplifying human imitation via prediction, because in this case we are predicting a low-dimensional signal whereas with imitation we have a high-dimensional signal with a much lower-dimensional but unknown subspace of relevant features. Ofc this system has to be hardened[2:1] against daemons. Once the optimal quantilization parameter has been determined, we can use it for a debate about the pivotal act.
There are various further nuances that should be considered:
Various ways to train the debaters can be considered, e.g. starting by generating good models of human judges[4] by passive observation and fine-tuning via actual debate.
The quantilized game is not exactly zero-sum so we need to rule out superrational cooperation, either by designing the debaters appropriately or via a unitary AI that is designed to search for Nash equilibria.
Instead of a discrete sequence, we can imagine continuous fine-tuning of the parameter i.e. finding the most aggressive quantilization parameter which is self-endorsing.
Various way to operationalize and structure the debates might be considered e.g. the 4 player system.
Finally, we also need to deal with non-Cartesian daemons. For this, let’s consider a “Turing” setting, i.e. a machine learning “core” connected to an external computer “envelope”. We assume that the core is not powerful enough to host a non-Cartesian daemon all by itself. We then protect ourselves by applying quantilization to the envelope actions as well. We might also be able to design a more detailed prior about dangerous programs than about Lovecraftian utterances. This requires to autocalibrate at least two parameters, which seems doable: for example, increase both parameters together until you hit the Pareto frontier, then optimize along the frontier. A weakness in this approach is that, for very aggressive quantilization the prediction system will become too weak to be useful, but perhaps this can be compensated by the more detailed prior.
Without anything like factored cognition, which I am skeptical about.
To deal with Cartesian daemons we (i) only make predictions about the counterfactual in which the prediction wasn’t made (ii) only produce predictions when they pass a confidence threshold (iii)
use a physicalist prior (to be defined in a future post).[EDIT: Actually, this doesn’t work] Non-Cartesian daemons are addressed below.Including instructions for designing a different aligned AI.
This entire approach is the polar opposite’s of MIRI’s “avoid human models” strategy, nevertheless it seems viable.
I’m not sure this attacks goodharting directly enough. Optimizing a system for proxy utility moves its state out-of-distribution where proxy utility generalizes training utility incorrectly. This probably holds for debate optimized towards intended objectives as much as for more concrete framings with state and utility.
Dithering across the border of goodharting (of scope of a proxy utility) with quantilization is actionable, but isn’t about defining the border or formulating legible strategies for what to do about optimization when approaching the border. For example, one might try for shutdown, interrupt-for-oversight, or getting-back-inside-the-borders when optimization pushes the system outside, which is not quantilization. (Getting-back-inside-the-borders might even have weird-x-risk prevention as a convergent drive, but will oppose corrigibility. Some version of oversight/amplification might facilitate corrigibility.)
Debate seems more useful for amplification, extrapolating concepts in a way humans would, in order to become acceptable proxies in wider scopes, so that more and more debates become non-lovecraftian. This is a different concern from setting up optimization that works with some fixed proxy concepts as given.
I don’t understand what you’re saying here.
For debate, goodharting means producing an answer which can be defended successfully in front of the judge, even in the face of an opponent pointing out all the flaws, but which is nevertheless bad. My assumption here is: it’s harder to produce such an answer than producing a genuinely good (and defensible) answer. If this assumption holds, then there is a range of quantilization parameters which yields good answers.
For the question of “what is a good plan to solve AI risk”, the assumption seems solid enough since we’re not worried about coming across such deceptive plans on our own, and it’s hard to imagine humans producing one even on purpose. To the extent our search for plans relies mostly on our ability to evaluate arguments and find counterarguments, it seems like the difference between the former and the latter is not great anyway. This argument is especially strong if we use human debaters as baseline distribution, although in this case we are vulnerable to same competitiveness problem as amplified-imitation, namely that reliably predicting rich outputs might be infeasible.
For the question of “should we continue changing the quantilization parameter”, the assumption still holds because the debater arguing to stop at the given point can win by presenting a plan to solve AI risk which is superior to continuing to change the parameter.
Goodharting is about what happens in situations where “good” is undefined or uncertain or contentious, but still gets used for optimization. There are situations where it’s better-defined, and situations where it’s ill-defined, and an anti-goodharting agent strives to optimize only within scope of where it’s better-defined. I took “lovecraftian” as a proxy for situations where it’s ill-defined, and base distribution of quantilization that’s intended to oppose goodharting acts as a quantitative description of where it’s taken as better-defined, so for this purpose base distribution captures non-lovecraftian situations. Of the options you listed for debate, the distribution from imitation learning seems OK for this purpose, if amended by some anti-weirdness filters to exclude debates that can’t be reliably judged.
The main issues with anti-goodharting that I see is the difficulty of defining proxy utility and base distribution, the difficulty of making it corrigible, not locking-in into fixed proxy utility and base distribution, and the question of what to do about optimization that points out of scope.
My point is that if anti-goodharting and not development of quantilization is taken as a goal, then calibration of quantilization is not the kind of thing that helps, it doesn’t address the main issues. Like, even for quantilization, fiddling with base distribution and proxy utility is a more natural framing that’s strictly more general than fiddling with the quantilization parameter. If we are to pick a single number to improve, why privilege the quantilization parameter instead of some other parameter that influences base distribution and proxy utility?
The use of debates for amplification in this framing is for corrigibility part of anti-goodharting, a way to redefine utility proxy and expand the base distribution, learning from how the debates at the boundary of the previous base distribution go. Quantilization seems like a fine building block for this, sampling slightly lovecraftian debates that are good, which is the direction where we want to expand the scope.
The proxy utility in debate is perfectly well-defined: it is the ruling of the human judge. For the base distribution I also made some concrete proposals (which certainly might be improvable but are not obviously bad). As to corrigibility, I think it’s an ill-posed concept. I’m not sure how you imagine corrigibility in this case: AQD is a series of discrete “transactions” (debates), and nothing prevents you from modifying the AI between one and another. Even inside a debate, there is no incentive in the outer loop to resist modifications, whereas daemons would be impeded by quantilization. The “out of scope” case is also dodged by quantilization, if I understand what you mean by “out of scope”.
Why is it strictly more general? I don’t see it. It seems false, since for extreme value of the quantilization parameter we get optimization which is deterministic and hence cannot be equivalent to quantilization with different proxy and distribution.
The reason to pick the quantilization parameter is because it’s hard to determine, as opposed to the proxy and base distribution[1] for which there are concrete proposals with more-or-less clear motivation.
I don’t understand which “main issues” you think this doesn’t address. Can you describe a concrete attack vector?
If the base distribution is a bounded simplicity prior then it will have some parameters, and this is truly a weakness of the protocol. Still, I suspect that safety is less sensitive to these parameters and it is more tractable to determine them by connecting our ultimate theories of AI with brain science (i.e. looking for parameters which would mimic the computational bounds of human cognition).