Here are some thoughts on how to fix the issues with the proposal.
The initial proposal distribution should be sampling things close to the initial QAS, not sampling totally random neutral networks / etc.
This addresses the “quantilizers can’t do the work of finding powerful QASs” problem. Instead, quantilizers are only doing the work of improving QASs. We don’t need to assume smart, aligned QASs are significantly more common than malign QASs; instead, we only need to assume that catastrophic modifications are rare in the local neighborhood of something aligned.
This approach has some problems.
1a. We need to start out with a QAS in the same format as are trying to train (eg a big neutral net).
This is because (I presume) it’s pretty difficult to sample nearby QASs otherwise.
This is potentially annoying because it’s easier to build the initial value function and proposal distribution out of regular code.
1b. We need a distance metric which we can easily sample from, while also being quite confident that “close QASs are safe QASs” (if chosen at random).
We need to be pretty confident that close QASs are substantially similar; eg, there’s no “flip the value function” move which would count as “nearby” in our metric. This suggests a metric on outputs (ie, answers to questions) rather than directly on representation (eg, neural network weights). This is obviously more difficult to sample from.
Also, a single modified answer can easily be catastrophic down the line; eg, a “nearby” QAS might not directly flip the value function, but it might modify the value function to prefer a flip, while keeping other aspects identical. So we can come up with yet-more-sophisticated distance functions which check for closeness in higher-order properties (ie, close-0 might be similarity in answers, while close-1 is close-0 similarity in children’s answers, close-2 is the close-0 similarity of grandchildren, etc). Each increment makes sampling significantly more difficult, but makes it more difficult for random sampling to bring catastrophe.
There may be other distance metrics which capture what’s important even better.
The question is: at what point are we satisfied that catastrophe is improbable enough? IE, what distance metric allows us to turn up our quantilizer’s optimization sufficiently?
The next stage of quantilization should always be an improvement in value; perhaps, always an improvement in several relevant dimensions of value.
This is intended to address the “iterated quantilization accumulates risk of drift” problem.
The requirement is close to “the quantilizers need to be sampling at most the top 50%”, but that’s not quite right, because the current QAS might not be at 50%. If it’s hard to sample improvements (so most samples are worse by our value function), the current QAS might be high in the ranking.
This gives us a stopping condition: the system should never continue if it can’t safely eliminate current and past QASs from the running. Each step should be better than what’s come before.
It’s also possible we should require each new QAS to be “better in every way”; ie, require simultaneous improvement along several important dimensions. For example, improvement in raw score on example questions, improvement on “safety of cognition” score (ie, evaluation of how good a thought process looks when it’s examined by the current QAS), improvement on “quality of proposal distribution”, improvement on “quality of value function”.
Each additional requirement makes it more difficult to proceed with quantilization, and hence, more probable that the process will stop early.
It’s particularly important that the next QAS has a lower chance of catastrophe—indeed, it should be significantly lower. For example, we might require it to cut chance of catastrophe in half. (I’m not working out how risk accumulates in detail; this might not be a good proposal.)
Chance of catastrophe is as evaluated by previous QASs. So it’s critical that the initial QAS has a good model of how to evaluate risk of catastrophe. This is, of course, difficult. Moreover, the quality of this evaluation is another critical feature to preserve and improve. One bad risk evaluation in the chain can itself become a catastrophe.
Here are some thoughts on how to fix the issues with the proposal.
The initial proposal distribution should be sampling things close to the initial QAS, not sampling totally random neutral networks / etc.
This addresses the “quantilizers can’t do the work of finding powerful QASs” problem. Instead, quantilizers are only doing the work of improving QASs. We don’t need to assume smart, aligned QASs are significantly more common than malign QASs; instead, we only need to assume that catastrophic modifications are rare in the local neighborhood of something aligned.
This approach has some problems.
1a. We need to start out with a QAS in the same format as are trying to train (eg a big neutral net).
This is because (I presume) it’s pretty difficult to sample nearby QASs otherwise.
This is potentially annoying because it’s easier to build the initial value function and proposal distribution out of regular code.
1b. We need a distance metric which we can easily sample from, while also being quite confident that “close QASs are safe QASs” (if chosen at random).
We need to be pretty confident that close QASs are substantially similar; eg, there’s no “flip the value function” move which would count as “nearby” in our metric. This suggests a metric on outputs (ie, answers to questions) rather than directly on representation (eg, neural network weights). This is obviously more difficult to sample from.
Also, a single modified answer can easily be catastrophic down the line; eg, a “nearby” QAS might not directly flip the value function, but it might modify the value function to prefer a flip, while keeping other aspects identical. So we can come up with yet-more-sophisticated distance functions which check for closeness in higher-order properties (ie, close-0 might be similarity in answers, while close-1 is close-0 similarity in children’s answers, close-2 is the close-0 similarity of grandchildren, etc). Each increment makes sampling significantly more difficult, but makes it more difficult for random sampling to bring catastrophe.
There may be other distance metrics which capture what’s important even better.
The question is: at what point are we satisfied that catastrophe is improbable enough? IE, what distance metric allows us to turn up our quantilizer’s optimization sufficiently?
The next stage of quantilization should always be an improvement in value; perhaps, always an improvement in several relevant dimensions of value.
This is intended to address the “iterated quantilization accumulates risk of drift” problem.
The requirement is close to “the quantilizers need to be sampling at most the top 50%”, but that’s not quite right, because the current QAS might not be at 50%. If it’s hard to sample improvements (so most samples are worse by our value function), the current QAS might be high in the ranking.
This gives us a stopping condition: the system should never continue if it can’t safely eliminate current and past QASs from the running. Each step should be better than what’s come before.
It’s also possible we should require each new QAS to be “better in every way”; ie, require simultaneous improvement along several important dimensions. For example, improvement in raw score on example questions, improvement on “safety of cognition” score (ie, evaluation of how good a thought process looks when it’s examined by the current QAS), improvement on “quality of proposal distribution”, improvement on “quality of value function”.
Each additional requirement makes it more difficult to proceed with quantilization, and hence, more probable that the process will stop early.
It’s particularly important that the next QAS has a lower chance of catastrophe—indeed, it should be significantly lower. For example, we might require it to cut chance of catastrophe in half. (I’m not working out how risk accumulates in detail; this might not be a good proposal.)
Chance of catastrophe is as evaluated by previous QASs. So it’s critical that the initial QAS has a good model of how to evaluate risk of catastrophe. This is, of course, difficult. Moreover, the quality of this evaluation is another critical feature to preserve and improve. One bad risk evaluation in the chain can itself become a catastrophe.