Goodharting is about what happens in situations where “good” is undefined or uncertain or contentious, but still gets used for optimization. There are situations where it’s better-defined, and situations where it’s ill-defined, and an anti-goodharting agent strives to optimize only within scope of where it’s better-defined. I took “lovecraftian” as a proxy for situations where it’s ill-defined, and base distribution of quantilization that’s intended to oppose goodharting acts as a quantitative description of where it’s taken as better-defined, so for this purpose base distribution captures non-lovecraftian situations. Of the options you listed for debate, the distribution from imitation learning seems OK for this purpose, if amended by some anti-weirdness filters to exclude debates that can’t be reliably judged.
The main issues with anti-goodharting that I see is the difficulty of defining proxy utility and base distribution, the difficulty of making it corrigible, not locking-in into fixed proxy utility and base distribution, and the question of what to do about optimization that points out of scope.
My point is that if anti-goodharting and not development of quantilization is taken as a goal, then calibration of quantilization is not the kind of thing that helps, it doesn’t address the main issues. Like, even for quantilization, fiddling with base distribution and proxy utility is a more natural framing that’s strictly more general than fiddling with the quantilization parameter. If we are to pick a single number to improve, why privilege the quantilization parameter instead of some other parameter that influences base distribution and proxy utility?
The use of debates for amplification in this framing is for corrigibility part of anti-goodharting, a way to redefine utility proxy and expand the base distribution, learning from how the debates at the boundary of the previous base distribution go. Quantilization seems like a fine building block for this, sampling slightly lovecraftian debates that are good, which is the direction where we want to expand the scope.
The main issues with anti-goodharting that I see is the difficulty of defining proxy utility and base distribution, the difficulty of making it corrigible, not locking-in into fixed proxy utility and base distribution, and the question of what to do about optimization that points out of scope.
The proxy utility in debate is perfectly well-defined: it is the ruling of the human judge. For the base distribution I also made some concrete proposals (which certainly might be improvable but are not obviously bad). As to corrigibility, I think it’s an ill-posed concept. I’m not sure how you imagine corrigibility in this case: AQD is a series of discrete “transactions” (debates), and nothing prevents you from modifying the AI between one and another. Even inside a debate, there is no incentive in the outer loop to resist modifications, whereas daemons would be impeded by quantilization. The “out of scope” case is also dodged by quantilization, if I understand what you mean by “out of scope”.
...fiddling with base distribution and proxy utility is a more natural framing that’s strictly more general than fiddling with the quantilization parameter.
Why is it strictly more general? I don’t see it. It seems false, since for extreme value of the quantilization parameter we get optimization which is deterministic and hence cannot be equivalent to quantilization with different proxy and distribution.
If we are to pick a single number to improve, why privilege the quantilization parameter instead of some other parameter that influences base distribution and proxy utility?
The reason to pick the quantilization parameter is because it’s hard to determine, as opposed to the proxy and base distribution[1] for which there are concrete proposals with more-or-less clear motivation.
I don’t understand which “main issues” you think this doesn’t address. Can you describe a concrete attack vector?
If the base distribution is a bounded simplicity prior then it will have some parameters, and this is truly a weakness of the protocol. Still, I suspect that safety is less sensitive to these parameters and it is more tractable to determine them by connecting our ultimate theories of AI with brain science (i.e. looking for parameters which would mimic the computational bounds of human cognition).
Goodharting is about what happens in situations where “good” is undefined or uncertain or contentious, but still gets used for optimization. There are situations where it’s better-defined, and situations where it’s ill-defined, and an anti-goodharting agent strives to optimize only within scope of where it’s better-defined. I took “lovecraftian” as a proxy for situations where it’s ill-defined, and base distribution of quantilization that’s intended to oppose goodharting acts as a quantitative description of where it’s taken as better-defined, so for this purpose base distribution captures non-lovecraftian situations. Of the options you listed for debate, the distribution from imitation learning seems OK for this purpose, if amended by some anti-weirdness filters to exclude debates that can’t be reliably judged.
The main issues with anti-goodharting that I see is the difficulty of defining proxy utility and base distribution, the difficulty of making it corrigible, not locking-in into fixed proxy utility and base distribution, and the question of what to do about optimization that points out of scope.
My point is that if anti-goodharting and not development of quantilization is taken as a goal, then calibration of quantilization is not the kind of thing that helps, it doesn’t address the main issues. Like, even for quantilization, fiddling with base distribution and proxy utility is a more natural framing that’s strictly more general than fiddling with the quantilization parameter. If we are to pick a single number to improve, why privilege the quantilization parameter instead of some other parameter that influences base distribution and proxy utility?
The use of debates for amplification in this framing is for corrigibility part of anti-goodharting, a way to redefine utility proxy and expand the base distribution, learning from how the debates at the boundary of the previous base distribution go. Quantilization seems like a fine building block for this, sampling slightly lovecraftian debates that are good, which is the direction where we want to expand the scope.
The proxy utility in debate is perfectly well-defined: it is the ruling of the human judge. For the base distribution I also made some concrete proposals (which certainly might be improvable but are not obviously bad). As to corrigibility, I think it’s an ill-posed concept. I’m not sure how you imagine corrigibility in this case: AQD is a series of discrete “transactions” (debates), and nothing prevents you from modifying the AI between one and another. Even inside a debate, there is no incentive in the outer loop to resist modifications, whereas daemons would be impeded by quantilization. The “out of scope” case is also dodged by quantilization, if I understand what you mean by “out of scope”.
Why is it strictly more general? I don’t see it. It seems false, since for extreme value of the quantilization parameter we get optimization which is deterministic and hence cannot be equivalent to quantilization with different proxy and distribution.
The reason to pick the quantilization parameter is because it’s hard to determine, as opposed to the proxy and base distribution[1] for which there are concrete proposals with more-or-less clear motivation.
I don’t understand which “main issues” you think this doesn’t address. Can you describe a concrete attack vector?
If the base distribution is a bounded simplicity prior then it will have some parameters, and this is truly a weakness of the protocol. Still, I suspect that safety is less sensitive to these parameters and it is more tractable to determine them by connecting our ultimate theories of AI with brain science (i.e. looking for parameters which would mimic the computational bounds of human cognition).