I originally introduced the recursive quantilizers idea here, but didn’t provide a formal model until my recent Learning Normativity post. That formal model had some problems. I’ll correct some of those problems here. My new model is closer to HCH+IDA, and so, is even closer to Paul Christiano style systems than my previous.
However, I’m also beginning to suspect that quantilizers aren’t the right starting point. I’ll state several problems with quantilizers at the end of this post.
First, let’s reiterate the design criteria, and why the model in Learning Normativity wasn’t great.
Criteria
Here are the criteria from Learning Normativity, with slight revisions. See the earlier post for further justifications/intuitions behind these criteria.
No Perfect Feedback: we want to be able to learn with the possibility that any one piece of data is corrupt.
Uncertain Feedback: data can be given in an uncertain form, allowing 100% certain feedback to be given (if there ever is such a thing), but also allowing the system to learn significant things in the absence of any certainty.
Reinterpretable Feedback: ideally, we want rich hypotheses about the meaning of feedback, which help the system to identify corrupt feedback, and interpret the information in imperfect feedback. To this criterion, I add two clarifying criteria:
Robust Listening: in some sense, we don’t want the system to be able to “entirely ignore” humans. If the system goes off-course, we want to be able to correct that.
Arbitrary Reinterpretation: at the same time, we want the AI to be able to entirely reinterpret feedback based on a rich model of what humans mean. This criterion stands in tension with Robust Listening. However, the proposal in the present post is, I think, a plausible way to achieve both.
No Perfect Loss Function: we don’t expect to perfectly define the utility function, or what it means to correctly learn the utility function, or what it means to learn to learn, and so on. At no level do we expect to be able to provide a single function we’re happy to optimize. This is largely due to a combination of Goodhart and corrupt-feedback concerns.
Learning at All Levels: Although we don’t have perfect information at any level, we do get meaningful benefit with each level we step back and say “we’re learning this level rather than keeping it fixed”, because we can provide meaningful approximate loss functions at each level, and meaningful feedback for learning at each level. Therefore, we want to be able to do learning at each level.
Between-Level Sharing: Because this implies an infinite hierarchy of levels to learn, we need to share a great deal of information between levels in order to learn meaningfully. For example, Occam’s razor is an important heuristic at each level, and information about what malign inner optimizers look like is the same at each level.
Process Level Feedback: we want to be able to give feedback about how to arrive at answers, not just the answers themselves.
Whole-Process Feedback: we don’t want some segregated meta-level which accepts/implements our process feedback about the rest of the system, but which is immune to process feedback itself. Any part of the system which is capable of adapting its behavior, we want to be able to give process-level feedback about.
Learned Generalization of Process Feedback: we don’t just want to promote or demote specific hypotheses. We want the system to learn from our feedback, making generalizations about which kinds of hypotheses are good or bad.
No reinterpretable feedback. I didn’t provide any method for achieving that.
No whole-process feedback. The way I set things up, the initial distributions are judged only on their later consequences. This leaves them wide open to inner optimizers and other problematic reasoning steps.
We can fix this by allowing the user to give direct feedback on the initial distributions as well, but then there’s no mechanism for Learned Generalization of that particular feedback. So we’re caught in the middle, unable to satisfy both those criteria at once.
The current proposal solves both problems, and due to an analogy to iterated amplification, may also be more computationally feasible.
The New Proposal
Like iterated amplification, the new proposal consists of both an idealized definition of aligned behavior (HCH, in the context of iterated amplification) and a candidate approximation of this ideal (like iterated amplification itself, which is supposed to approximate HCH).
The Ideal
The object which quantilization will select on will be referred to as “question-answering systems”, or QAS for short. This is what I called a “UTAA” in the previous post. As before, this is one object which has opinions about the safe distribution for quantilization (you can ask it “what’s a safe distribution over QAS to quantilize on?”), and as value function (“give me a value function to judge the quality of QAS”) and as the object-level solution to whatever problem you’re trying to get this whole setup to solve (you ask it your object-level questions).
So the goal is to get a really good QAS, where “really good” means highly capable and highly aligned. Since we’re specifying the ideal here, we get to use unbounded processing power in doing so.
Humans provide an initial “safe” distribution on QASs, and an initial loss function which will be used to judge QASs for their quality. We then quantilize. This yields a new distribution over QASs, which we use to define another stage of quantilization (this time with safe distribution and value function coming from the QASs themselves), and so on, until we reach an equilibrium where an additional step does not add anything.
As before, we can imagine this as an infinite or very large tree of quantilization:
(Note that, previously, I treated Q as a three-argument stochastic function, requiring a safe distribution, a value function (AKA loss function), and a parameter telling it how much to optimize. Here, I am suppressing the third parameter. Perhaps we keep it fixed, or perhaps we ask a third QAS for it, as before—I have no improved suggestions for how we estimate that parameter at the moment.)
The tree structure means that a QAS is never evaluating itself; it is being evaluated by another QAS sampled from an independent multi-level quantilization, which can spot mistakes in its own way.
Now, loss functions in general (including both the initial human-given loss function, and the improved loss functions proposed by the QASs) need access to the entire experience of the system. The initial human-gives loss function may be something like “pick out explicit training feedback humans give, and translate it into a numerical score for a QAS, ignoring natural-language comments made by the humans, evidence from the context, etc.” The loss functions proposed by the QAS are supposed to get cleverer, interpreting natural-language feedback given by humans, using contextual clues, etc.
You might ask, “shouldn’t we build simple natural-language understanding into the initial loss function?”—we could, but that’s supposed to be redundant. There should be no reason to train a separate natural-language understanding system, rather than use explicit feedback to train the QAS to understand natural-language feedback.
Uncertain Feedback
The initial loss function should have the ability to learn from uncertain feedback, to satisfy the Uncertain Feedback criterion. (We can implement one of the options mentioned in Learning Normativity.)
Balancing Robust Listening and Arbitrary Reinterpretation
The idealized objective presented in this section offers a nice combination of “robust listening” (not ever reaching a state where it totally fails to respond to human feedback) and “arbitrary reinterpretation” (being able to use rich models to totally reinterpret the meaning of human feedback). These two objectives would seem to be at odds, since a system with an arbitrary model of the meaning of human feedback can ignore anything.
In the proposal here, direct human feedback always shapes the initial dynamic. If you see a system do something stupid and add direct feedback against this to the data, then QASs which engage in this behavior will be ruled out in the first step of quantilization. You can also add strong evidence against QAS which fail to judge QASs harshly for violating such principles (and which fail to punish non-punishers, recursively). So, human feedback always has total control to shape the ultimate result.
At the same time, however, only the initial dynamic is required to obey this feedback in a naive way. At later stages, QASs are supposed to understand human feedback better, identifying feedback which should be interpreted less literally.
This means the equilibrium result is both fully dependent on a 100% literal interpretation of human feedback, and at the same time, fully the result of a more careful, nuanced interpretation of what humans meant.
Process-Level Feedback
Also, it’s important to note that the initial loss function needs to understand process-level feedback. You can imagine that the dataset includes the following:
Extensive human-generated Q&A behavior, much like would be needed for Iterative Amplification. This is the bulk of what loads general world-knowledge into QAS that come out of the first round of quantilization.
Extensive labeling of QAS as “ok” and “bad”, to demonstrate how to examine QAS source code (/ neural network / etc) and evaluate when something looks wrong. This could include many toy examples which illustrate bad behavior, as well as real examples gathered during training. The QAS should be trained to imitate this behavior when evaluating other QAS.
Similarly, examples of humans examining QAS execution and labeling steps as bad, rather than just examining the source code directly.
You can imagine humans engaging in an iterative process where they examine the behavior of many QAS created by early attempts at this process, which creates more training data for later iterations. However, since we’re still only describing the ideal, that’s not realistic; humans never have access to the true output of the ideal process.
Let’s move on to describing the iterative-amplification analogue.
The Implementation Proposal
The basic idea is to take iterative amplification, and replace the HCH-like “amplification” operation with quantilization-based amplification.
At every stage of the process, we’re not learning a single QAS, but rather, learning a distribution over QAS.
Define the distribution A1 as the result of quantilizing the safe distribution given by humans, using the value function given by humans.
The distribution D1 is the distillation of A1; ie, we learn a more computationally efficient approximation of the distribution A1 using some distribution-modeling technique such as GANs.
The distribution A2 is defined by sampling two QAS from D1, asking one for the value function and the other for the safe distribution, and quantilizing.
The distribution D2 is obtained by learning an efficient approximation of A2.
And so on.An is defined as the output of a quantilizer which takes its value function and safe distribution from (independent) samples from Dn−1. Dn is a learned approximation of An.
If the distillation step were perfect, this would converge to the desired equilibrium.
Note, however, that this assumes that all human feedback is available from the beginning. Realistically, we want more interactive training, so that humans can observe the intermediate outputs of the process and provide new training data to account for problems. The vast majority of training data involving QAS evaluation may come from such interaction.
I visualize this as “iterated amplification takes us upward, but not rightward” in the following coordinates:
If we start amplifying something with very little human-input knowledge, we’re likely to get total nonsense out; the QASs just don’t know enough to optimize any coherent thing over successive amplification steps. If we start with a moderate amount of knowledge, the QASs might sputter into nonsense, or might get on the right track, but also have a high risk of producing highly-capable but malign agents. At some level of sufficient human input, we fall into the “basin of corrigibility” and get aligned outputs no matter what (perhaps even the same exact benign output).
Yet, it’s hard to provide everything up-front. So, more realistically, the picture might look like this:
The golden line is the standard set by the idealized model of recursive quantilizers, where we have all the training data before starting the amplification process. The green line is a more realistic line, where we make amplification moves and increase training data in a mixed way, but avoid over-amplifying an under-trained model.
It’s really important to stay close to the x-axis early in the process, because the system itself is determining how to evaluate the loss function—so it’s evaluating the very meaning of the training data (in line with the Reinterpretable Feedback criterion). It is therefore very important that we don’t let the system drift too far in the direction of an extreme reinterpretation of the training data (in line with the Robust Listening criterion). At the very start of the training process, I imagine you’d often restart the training process from scratch with all the new feedback in place, rather than trust the system to understand new data.
In other words, we don’t expect x-dimension moves to work if we’re too high in the y-dimension:
Unfortunately, it’s difficult to know what the region is where x-dimension moves work, so it’s difficult to know when amplification would keep us within that region vs take us out of it.
Another way to put it: this implementation puts the “robust listening” criterion at serious risk. The partially amplified agent can easily stop listening to human feedback on important matters, about which it is sure we must be mistaken.
Really, we would want to find an engineering solution around this problem, rather than haphazardly steering through the space like I’ve described. For example, there might be a way to train the system to seek the equilibrium it would have reached if it had started with all its current knowledge.
Comparison to Iterated Amplification
Because this proposal is so similar to iterated amplification, it bears belaboring the differences, particularly the philosophical differences underlying the choices I’ve made.
I don’t want this to be about critiquing iterated amplification—I have some critiques, but the approach here is not mutually exclusive with iterated amplification by any means. Instead, I just want to make clear the philosophical differences.
Both approaches emphasize deferring the big questions, setting up a system which does all the philosophical deliberation for us, rather than EG providing a correct decision theory.
Iterated Amplification puts humans in a central spot. The amplification operation is giving a human access to an (approximate) HCH—so at every stage, a human is making the ultimate decisions about how to use the capabilities of the system to answer questions. This plausibly has alignment and corrigibility advantages, but may put a ceiling on capabilities (since we have to rely on the human ability to decompose problems well, creating good plans for solving problems).
Recursive quantilization instead seeks to allow arbitrary improvements to the deliberation process. It’s all supervised by humans, and initially seeded by imitation of human question-answering; but humans can point out problems with the human deliberation process, and the system seeks to improve its reasoning using the human-seeded ideas about how to do so. To the extent that humans think HCH is the correct idealized reasoning process, recursive quantilization should approximate HCH. (To the extent it can’t do this, recursive quantilization fails at its goals.)
One response I’ve gotten to recursive quantilization is “couldn’t this just be advice to the human in HCH?” I don’t think that’s quite true.
HCH must walk a fine line between capability and safety. A big HCH tree can perform well at a vast array of tasks (if the human has a good strategy), but in order to be safe, the human must operate under set of restrictions, such as “don’t simulate unrestricted search in large hypothesis spaces”—with the full set of restrictions required for safety yet to be articulated. HCH needs a set of restrictions which provide safety without compromising capability.
In Inaccessible Information, Paul draws a distinction between accessible and inaccessible information. Roughly, information is accessible if we have a pretty good shot at getting modern ML to tell us about it, and inaccessible otherwise. Inaccessible information can include intuitive but difficult-to-test variables like “what Alice is thinking”, as well as superhuman concepts that a powerful AI might invent.
A powerful modern AI like GPT-3 might have and use inaccessible information such as “what the writer of this sentence was really thinking”, but we can’t get GPT-3 to tell us about it, because we lack a way to train it to.
One of the safety concerns of inaccessible information Paul lists is that powerful AIs might be more capable than aligned AIs due to their ability to utilize inaccessible information, where aligned AIs cannot. For example, GPT-5 might use inhuman concepts, derived from its vast experience predicting text, to achieve high performance. A safe HCH would never be able to use those concepts, since every computation within the HCH tree is supposed to be human-comprehensible. (Therefore, if the result of Iterated Amplification was able to use such concepts, we should be worried that it did not successfully approximate HCH.)
Paul proposes learning the human prior as a potential solution. As I understand it, the basic idea is that HCH lacks Deep Learning’s ability to absorb vast quantities of data and reach new conclusions. By learning the human prior, Paul seeks to learn the human response to those vast quantities of data. This would allow an HCH-like approach to learn the same “alien concepts” that a misaligned AI might learn.
I don’t really understand how Paul sees HCH and learned-priors as fitting together, Recursive Quantilization seeks to get around this difficulty by training the QASs on lots of data in a way similar to big-data ML. As I emphasized before, recursive quantilization seeks to allow arbitrary improvements to the reasoning process, so long as they are improvements.
Of course, recursive quantilization also needs to restrict the QAS reasoning processes via safety constraints (using process-level feedback to rule out types of reasoning which can’t be verified as safe). This could lead to the same problems with inaccessible information. But if it does, that’s because no human-approved reasoning can approve of reasoning which utilizes the inaccessible information—not even after many rounds of amplifying the human-seeded knowledge about how to judge reasoning as safe/unsafe. So, I would argue that if an approach broadly like recursive quantilization can’t do it, then no robustly aligned method can do it.
Reasons why quantilizers might not be appropriate:
As I mentioned at the beginning, I’ve been feeling like quantilizers might be the wrong starting point for the Learning Normativity agenda. There’s still some hope, but there are a few problems which make me want to look for other directions.
Quantilizers bound risk, but iteration increases risk arbitrarily.
The whole argument behind quantilization is as follows:
The safe distribution has below c probability of catastrophe.
The limited optimization samples from the top q of the safe distribution.
So, the risk of catastrophe from the limited optimization is at most c/q.
In other words, the chance of catastrophic risk is increased exactly in proportion to how much we optimize.
Recursive quantilization does this over and over and over again. Naively, as we seek the fixed point, it seems like our bound on catastrophic risk gets worse and worse until we have no bound at all.
There could be a way around this. We could get a better risk bound, which took into account the improved optimization from each step in the recursive quantilization. The story could be something like this:
In each quantilization step, we have an overwhelming chance of increasing the quality of value functions and safe distributions proposed in the next round.
This means (with overwhelming odds) we get better at detecting catastrophe, so we can actively optimize against problems which the previous round of quantilization had to merely hedge against by avoiding over-optimizing.
Therefore, the chance of catastrophe from each round of quantilization can be dramatically lower than the previous. Although these chances do add up over successive rounds of quantilization, they can sum to a negligible probability overall, even in the limit.
It would be nice to put together such a story, but it isn’t clear how to argue that the improvement in value function and proposal distribution cuts into the probability of catastrophe by a sufficient amount.
Quantilizers assume catastrophes are rare in random choices, but in the recursive setting, too much randomness can create value drift.
Quantilization assumes we can get a “safe distribution” which might produce mediocre actions, but at least has a very low probability of catastrophic risk.
Supposing the scheme could produce highly capable systems at all, the recursive quantilizer environment seems like one where randomized actions have a good chance of producing catastrophic risk, so safe distributions would be incredibly difficult to engineer.
Imagine we’re computing a big recursive quantilization tree. Randomly selecting a poor value function at one point will have a cascading effect. It will feed into a quantilization step which selects a bad QAS, which will go on to produce either the value function or the safe distribution for another quantilization, and so on. There’s not fundamentally any corrective effect—if either of the two inputs to quantilization is poor enough, then the output will be poor.
There’s a basin of attraction, here: if the proposed safe distribution always contains good proposals with non-negligible probability, and the value function always has enough of the right meta-principles to correct specific errors that may be introduced through random error. But it’s quite important that the output of each quantilization be better than the previous. If not, then we’re not in a basin of attraction.
All of this makes it sound quite difficult to propose a safe distribution. The safe distribution needs to already land you within the basin of attraction (with very high probability), because drifting out of that basin can easily create a catastrophe.
Here’s a slightly different argument. At each quantilization step, including the very first one, it’s important that we find a QAS which actually fits our data quite well, because it is important that we pin down various things firmly in order to remain in the basin of attraction (especially including pinning down a value function at least as good as our starting value function). However, for each QAS which fits the data quite well and points to our desired basin of attraction, there are many alternative QAS which don’t fit our data well, but point to very different, but equally coherent, basins of attraction. (In other words, there should be many equally internally consistent value systems which have basins of attraction of similar size.)
Since these other basins would be catastrophic, this means c, the probability of catastrophe, is higher than q, the amount of optimization we need to hit our narrow target.
This means the safe distributions has to be doing a lot of work for us.
Like the previous problem I discussed, this isn’t necessarily a showstopper, but it does say that we’d need some further ideas to make recursive quantilization work, and suggests to me that quantilization might not be the right way to go.
Other Concerns
Quantilizers don’t have the best handles for modeling human philosophical deliberation over time. In other words, I don’t think recursive quantilization absorbs the lesson of radical probabilism. In particular, although recursive quantilization involves iteratively improving a picture of “good reasoning”, I think it lacks a kind of stability—the picture of good reasoning must be entirely reconstructed each time, from “first principles” (IE from the principles developed in the previous step). I currently see no guarantee that recursive quantilization avoids being Dutch-Book-able over these stages, or any other such dynamic optimality notion.
Quantilizers aren’t great for modeling a collection of partial models. Since a quantilizer spits out one (moderately) optimized result, I have to focus on single QASs, rather than collections of experts which cover different areas of expertise. This means we don’t get to break down the problem of reasoning about the world.
Quantilizers don’t put world models in a central spot. By putting optimization in a central spot, we sort of sideline reasoning and belief. This obscures the mechanism of updating on new information.
Recursive Quantilizers II
I originally introduced the recursive quantilizers idea here, but didn’t provide a formal model until my recent Learning Normativity post. That formal model had some problems. I’ll correct some of those problems here. My new model is closer to HCH+IDA, and so, is even closer to Paul Christiano style systems than my previous.
However, I’m also beginning to suspect that quantilizers aren’t the right starting point. I’ll state several problems with quantilizers at the end of this post.
First, let’s reiterate the design criteria, and why the model in Learning Normativity wasn’t great.
Criteria
Here are the criteria from Learning Normativity, with slight revisions. See the earlier post for further justifications/intuitions behind these criteria.
No Perfect Feedback: we want to be able to learn with the possibility that any one piece of data is corrupt.
Uncertain Feedback: data can be given in an uncertain form, allowing 100% certain feedback to be given (if there ever is such a thing), but also allowing the system to learn significant things in the absence of any certainty.
Reinterpretable Feedback: ideally, we want rich hypotheses about the meaning of feedback, which help the system to identify corrupt feedback, and interpret the information in imperfect feedback. To this criterion, I add two clarifying criteria:
Robust Listening: in some sense, we don’t want the system to be able to “entirely ignore” humans. If the system goes off-course, we want to be able to correct that.
Arbitrary Reinterpretation: at the same time, we want the AI to be able to entirely reinterpret feedback based on a rich model of what humans mean. This criterion stands in tension with Robust Listening. However, the proposal in the present post is, I think, a plausible way to achieve both.
No Perfect Loss Function: we don’t expect to perfectly define the utility function, or what it means to correctly learn the utility function, or what it means to learn to learn, and so on. At no level do we expect to be able to provide a single function we’re happy to optimize. This is largely due to a combination of Goodhart and corrupt-feedback concerns.
Learning at All Levels: Although we don’t have perfect information at any level, we do get meaningful benefit with each level we step back and say “we’re learning this level rather than keeping it fixed”, because we can provide meaningful approximate loss functions at each level, and meaningful feedback for learning at each level. Therefore, we want to be able to do learning at each level.
Between-Level Sharing: Because this implies an infinite hierarchy of levels to learn, we need to share a great deal of information between levels in order to learn meaningfully. For example, Occam’s razor is an important heuristic at each level, and information about what malign inner optimizers look like is the same at each level.
Process Level Feedback: we want to be able to give feedback about how to arrive at answers, not just the answers themselves.
Whole-Process Feedback: we don’t want some segregated meta-level which accepts/implements our process feedback about the rest of the system, but which is immune to process feedback itself. Any part of the system which is capable of adapting its behavior, we want to be able to give process-level feedback about.
Learned Generalization of Process Feedback: we don’t just want to promote or demote specific hypotheses. We want the system to learn from our feedback, making generalizations about which kinds of hypotheses are good or bad.
Failed Criteria
The previous recursive-quantilization model failed some criteria:
No reinterpretable feedback. I didn’t provide any method for achieving that.
No whole-process feedback. The way I set things up, the initial distributions are judged only on their later consequences. This leaves them wide open to inner optimizers and other problematic reasoning steps.
We can fix this by allowing the user to give direct feedback on the initial distributions as well, but then there’s no mechanism for Learned Generalization of that particular feedback. So we’re caught in the middle, unable to satisfy both those criteria at once.
The current proposal solves both problems, and due to an analogy to iterated amplification, may also be more computationally feasible.
The New Proposal
Like iterated amplification, the new proposal consists of both an idealized definition of aligned behavior (HCH, in the context of iterated amplification) and a candidate approximation of this ideal (like iterated amplification itself, which is supposed to approximate HCH).
The Ideal
The object which quantilization will select on will be referred to as “question-answering systems”, or QAS for short. This is what I called a “UTAA” in the previous post. As before, this is one object which has opinions about the safe distribution for quantilization (you can ask it “what’s a safe distribution over QAS to quantilize on?”), and as value function (“give me a value function to judge the quality of QAS”) and as the object-level solution to whatever problem you’re trying to get this whole setup to solve (you ask it your object-level questions).
So the goal is to get a really good QAS, where “really good” means highly capable and highly aligned. Since we’re specifying the ideal here, we get to use unbounded processing power in doing so.
Humans provide an initial “safe” distribution on QASs, and an initial loss function which will be used to judge QASs for their quality. We then quantilize. This yields a new distribution over QASs, which we use to define another stage of quantilization (this time with safe distribution and value function coming from the QASs themselves), and so on, until we reach an equilibrium where an additional step does not add anything.
As before, we can imagine this as an infinite or very large tree of quantilization:
(Note that, previously, I treated Q as a three-argument stochastic function, requiring a safe distribution, a value function (AKA loss function), and a parameter telling it how much to optimize. Here, I am suppressing the third parameter. Perhaps we keep it fixed, or perhaps we ask a third QAS for it, as before—I have no improved suggestions for how we estimate that parameter at the moment.)
The tree structure means that a QAS is never evaluating itself; it is being evaluated by another QAS sampled from an independent multi-level quantilization, which can spot mistakes in its own way.
Now, loss functions in general (including both the initial human-given loss function, and the improved loss functions proposed by the QASs) need access to the entire experience of the system. The initial human-gives loss function may be something like “pick out explicit training feedback humans give, and translate it into a numerical score for a QAS, ignoring natural-language comments made by the humans, evidence from the context, etc.” The loss functions proposed by the QAS are supposed to get cleverer, interpreting natural-language feedback given by humans, using contextual clues, etc.
You might ask, “shouldn’t we build simple natural-language understanding into the initial loss function?”—we could, but that’s supposed to be redundant. There should be no reason to train a separate natural-language understanding system, rather than use explicit feedback to train the QAS to understand natural-language feedback.
Uncertain Feedback
The initial loss function should have the ability to learn from uncertain feedback, to satisfy the Uncertain Feedback criterion. (We can implement one of the options mentioned in Learning Normativity.)
Balancing Robust Listening and Arbitrary Reinterpretation
The idealized objective presented in this section offers a nice combination of “robust listening” (not ever reaching a state where it totally fails to respond to human feedback) and “arbitrary reinterpretation” (being able to use rich models to totally reinterpret the meaning of human feedback). These two objectives would seem to be at odds, since a system with an arbitrary model of the meaning of human feedback can ignore anything.
In the proposal here, direct human feedback always shapes the initial dynamic. If you see a system do something stupid and add direct feedback against this to the data, then QASs which engage in this behavior will be ruled out in the first step of quantilization. You can also add strong evidence against QAS which fail to judge QASs harshly for violating such principles (and which fail to punish non-punishers, recursively). So, human feedback always has total control to shape the ultimate result.
At the same time, however, only the initial dynamic is required to obey this feedback in a naive way. At later stages, QASs are supposed to understand human feedback better, identifying feedback which should be interpreted less literally.
This means the equilibrium result is both fully dependent on a 100% literal interpretation of human feedback, and at the same time, fully the result of a more careful, nuanced interpretation of what humans meant.
Process-Level Feedback
Also, it’s important to note that the initial loss function needs to understand process-level feedback. You can imagine that the dataset includes the following:
Extensive human-generated Q&A behavior, much like would be needed for Iterative Amplification. This is the bulk of what loads general world-knowledge into QAS that come out of the first round of quantilization.
Extensive labeling of QAS as “ok” and “bad”, to demonstrate how to examine QAS source code (/ neural network / etc) and evaluate when something looks wrong. This could include many toy examples which illustrate bad behavior, as well as real examples gathered during training. The QAS should be trained to imitate this behavior when evaluating other QAS.
Similarly, examples of humans examining QAS execution and labeling steps as bad, rather than just examining the source code directly.
You can imagine humans engaging in an iterative process where they examine the behavior of many QAS created by early attempts at this process, which creates more training data for later iterations. However, since we’re still only describing the ideal, that’s not realistic; humans never have access to the true output of the ideal process.
Let’s move on to describing the iterative-amplification analogue.
The Implementation Proposal
The basic idea is to take iterative amplification, and replace the HCH-like “amplification” operation with quantilization-based amplification.
At every stage of the process, we’re not learning a single QAS, but rather, learning a distribution over QAS.
Define the distribution A1 as the result of quantilizing the safe distribution given by humans, using the value function given by humans.
The distribution D1 is the distillation of A1; ie, we learn a more computationally efficient approximation of the distribution A1 using some distribution-modeling technique such as GANs.
The distribution A2 is defined by sampling two QAS from D1, asking one for the value function and the other for the safe distribution, and quantilizing.
The distribution D2 is obtained by learning an efficient approximation of A2.
And so on.An is defined as the output of a quantilizer which takes its value function and safe distribution from (independent) samples from Dn−1. Dn is a learned approximation of An.
If the distillation step were perfect, this would converge to the desired equilibrium.
Note, however, that this assumes that all human feedback is available from the beginning. Realistically, we want more interactive training, so that humans can observe the intermediate outputs of the process and provide new training data to account for problems. The vast majority of training data involving QAS evaluation may come from such interaction.
I visualize this as “iterated amplification takes us upward, but not rightward” in the following coordinates:
If we start amplifying something with very little human-input knowledge, we’re likely to get total nonsense out; the QASs just don’t know enough to optimize any coherent thing over successive amplification steps. If we start with a moderate amount of knowledge, the QASs might sputter into nonsense, or might get on the right track, but also have a high risk of producing highly-capable but malign agents. At some level of sufficient human input, we fall into the “basin of corrigibility” and get aligned outputs no matter what (perhaps even the same exact benign output).
Yet, it’s hard to provide everything up-front. So, more realistically, the picture might look like this:
The golden line is the standard set by the idealized model of recursive quantilizers, where we have all the training data before starting the amplification process. The green line is a more realistic line, where we make amplification moves and increase training data in a mixed way, but avoid over-amplifying an under-trained model.
It’s really important to stay close to the x-axis early in the process, because the system itself is determining how to evaluate the loss function—so it’s evaluating the very meaning of the training data (in line with the Reinterpretable Feedback criterion). It is therefore very important that we don’t let the system drift too far in the direction of an extreme reinterpretation of the training data (in line with the Robust Listening criterion). At the very start of the training process, I imagine you’d often restart the training process from scratch with all the new feedback in place, rather than trust the system to understand new data.
In other words, we don’t expect x-dimension moves to work if we’re too high in the y-dimension:
Unfortunately, it’s difficult to know what the region is where x-dimension moves work, so it’s difficult to know when amplification would keep us within that region vs take us out of it.
Another way to put it: this implementation puts the “robust listening” criterion at serious risk. The partially amplified agent can easily stop listening to human feedback on important matters, about which it is sure we must be mistaken.
Really, we would want to find an engineering solution around this problem, rather than haphazardly steering through the space like I’ve described. For example, there might be a way to train the system to seek the equilibrium it would have reached if it had started with all its current knowledge.
Comparison to Iterated Amplification
Because this proposal is so similar to iterated amplification, it bears belaboring the differences, particularly the philosophical differences underlying the choices I’ve made.
I don’t want this to be about critiquing iterated amplification—I have some critiques, but the approach here is not mutually exclusive with iterated amplification by any means. Instead, I just want to make clear the philosophical differences.
Both approaches emphasize deferring the big questions, setting up a system which does all the philosophical deliberation for us, rather than EG providing a correct decision theory.
Iterated Amplification puts humans in a central spot. The amplification operation is giving a human access to an (approximate) HCH—so at every stage, a human is making the ultimate decisions about how to use the capabilities of the system to answer questions. This plausibly has alignment and corrigibility advantages, but may put a ceiling on capabilities (since we have to rely on the human ability to decompose problems well, creating good plans for solving problems).
Recursive quantilization instead seeks to allow arbitrary improvements to the deliberation process. It’s all supervised by humans, and initially seeded by imitation of human question-answering; but humans can point out problems with the human deliberation process, and the system seeks to improve its reasoning using the human-seeded ideas about how to do so. To the extent that humans think HCH is the correct idealized reasoning process, recursive quantilization should approximate HCH. (To the extent it can’t do this, recursive quantilization fails at its goals.)
One response I’ve gotten to recursive quantilization is “couldn’t this just be advice to the human in HCH?” I don’t think that’s quite true.
HCH must walk a fine line between capability and safety. A big HCH tree can perform well at a vast array of tasks (if the human has a good strategy), but in order to be safe, the human must operate under set of restrictions, such as “don’t simulate unrestricted search in large hypothesis spaces”—with the full set of restrictions required for safety yet to be articulated. HCH needs a set of restrictions which provide safety without compromising capability.
In Inaccessible Information, Paul draws a distinction between accessible and inaccessible information. Roughly, information is accessible if we have a pretty good shot at getting modern ML to tell us about it, and inaccessible otherwise. Inaccessible information can include intuitive but difficult-to-test variables like “what Alice is thinking”, as well as superhuman concepts that a powerful AI might invent.
A powerful modern AI like GPT-3 might have and use inaccessible information such as “what the writer of this sentence was really thinking”, but we can’t get GPT-3 to tell us about it, because we lack a way to train it to.
One of the safety concerns of inaccessible information Paul lists is that powerful AIs might be more capable than aligned AIs due to their ability to utilize inaccessible information, where aligned AIs cannot. For example, GPT-5 might use inhuman concepts, derived from its vast experience predicting text, to achieve high performance. A safe HCH would never be able to use those concepts, since every computation within the HCH tree is supposed to be human-comprehensible. (Therefore, if the result of Iterated Amplification was able to use such concepts, we should be worried that it did not successfully approximate HCH.)
Paul proposes learning the human prior as a potential solution. As I understand it, the basic idea is that HCH lacks Deep Learning’s ability to absorb vast quantities of data and reach new conclusions. By learning the human prior, Paul seeks to learn the human response to those vast quantities of data. This would allow an HCH-like approach to learn the same “alien concepts” that a misaligned AI might learn.
I don’t really understand how Paul sees HCH and learned-priors as fitting together, Recursive Quantilization seeks to get around this difficulty by training the QASs on lots of data in a way similar to big-data ML. As I emphasized before, recursive quantilization seeks to allow arbitrary improvements to the reasoning process, so long as they are improvements.
Of course, recursive quantilization also needs to restrict the QAS reasoning processes via safety constraints (using process-level feedback to rule out types of reasoning which can’t be verified as safe). This could lead to the same problems with inaccessible information. But if it does, that’s because no human-approved reasoning can approve of reasoning which utilizes the inaccessible information—not even after many rounds of amplifying the human-seeded knowledge about how to judge reasoning as safe/unsafe. So, I would argue that if an approach broadly like recursive quantilization can’t do it, then no robustly aligned method can do it.
Reasons why quantilizers might not be appropriate:
As I mentioned at the beginning, I’ve been feeling like quantilizers might be the wrong starting point for the Learning Normativity agenda. There’s still some hope, but there are a few problems which make me want to look for other directions.
Quantilizers bound risk, but iteration increases risk arbitrarily.
The whole argument behind quantilization is as follows:
The safe distribution has below c probability of catastrophe.
The limited optimization samples from the top q of the safe distribution.
So, the risk of catastrophe from the limited optimization is at most c/q.
In other words, the chance of catastrophic risk is increased exactly in proportion to how much we optimize.
Recursive quantilization does this over and over and over again. Naively, as we seek the fixed point, it seems like our bound on catastrophic risk gets worse and worse until we have no bound at all.
There could be a way around this. We could get a better risk bound, which took into account the improved optimization from each step in the recursive quantilization. The story could be something like this:
In each quantilization step, we have an overwhelming chance of increasing the quality of value functions and safe distributions proposed in the next round.
This means (with overwhelming odds) we get better at detecting catastrophe, so we can actively optimize against problems which the previous round of quantilization had to merely hedge against by avoiding over-optimizing.
Therefore, the chance of catastrophe from each round of quantilization can be dramatically lower than the previous. Although these chances do add up over successive rounds of quantilization, they can sum to a negligible probability overall, even in the limit.
It would be nice to put together such a story, but it isn’t clear how to argue that the improvement in value function and proposal distribution cuts into the probability of catastrophe by a sufficient amount.
Quantilizers assume catastrophes are rare in random choices, but in the recursive setting, too much randomness can create value drift.
Quantilization assumes we can get a “safe distribution” which might produce mediocre actions, but at least has a very low probability of catastrophic risk.
Supposing the scheme could produce highly capable systems at all, the recursive quantilizer environment seems like one where randomized actions have a good chance of producing catastrophic risk, so safe distributions would be incredibly difficult to engineer.
Imagine we’re computing a big recursive quantilization tree. Randomly selecting a poor value function at one point will have a cascading effect. It will feed into a quantilization step which selects a bad QAS, which will go on to produce either the value function or the safe distribution for another quantilization, and so on. There’s not fundamentally any corrective effect—if either of the two inputs to quantilization is poor enough, then the output will be poor.
There’s a basin of attraction, here: if the proposed safe distribution always contains good proposals with non-negligible probability, and the value function always has enough of the right meta-principles to correct specific errors that may be introduced through random error. But it’s quite important that the output of each quantilization be better than the previous. If not, then we’re not in a basin of attraction.
All of this makes it sound quite difficult to propose a safe distribution. The safe distribution needs to already land you within the basin of attraction (with very high probability), because drifting out of that basin can easily create a catastrophe.
Here’s a slightly different argument. At each quantilization step, including the very first one, it’s important that we find a QAS which actually fits our data quite well, because it is important that we pin down various things firmly in order to remain in the basin of attraction (especially including pinning down a value function at least as good as our starting value function). However, for each QAS which fits the data quite well and points to our desired basin of attraction, there are many alternative QAS which don’t fit our data well, but point to very different, but equally coherent, basins of attraction. (In other words, there should be many equally internally consistent value systems which have basins of attraction of similar size.)
Since these other basins would be catastrophic, this means c, the probability of catastrophe, is higher than q, the amount of optimization we need to hit our narrow target.
This means the safe distributions has to be doing a lot of work for us.
Like the previous problem I discussed, this isn’t necessarily a showstopper, but it does say that we’d need some further ideas to make recursive quantilization work, and suggests to me that quantilization might not be the right way to go.
Other Concerns
Quantilizers don’t have the best handles for modeling human philosophical deliberation over time. In other words, I don’t think recursive quantilization absorbs the lesson of radical probabilism. In particular, although recursive quantilization involves iteratively improving a picture of “good reasoning”, I think it lacks a kind of stability—the picture of good reasoning must be entirely reconstructed each time, from “first principles” (IE from the principles developed in the previous step). I currently see no guarantee that recursive quantilization avoids being Dutch-Book-able over these stages, or any other such dynamic optimality notion.
Quantilizers aren’t great for modeling a collection of partial models. Since a quantilizer spits out one (moderately) optimized result, I have to focus on single QASs, rather than collections of experts which cover different areas of expertise. This means we don’t get to break down the problem of reasoning about the world.
Quantilizers don’t put world models in a central spot. By putting optimization in a central spot, we sort of sideline reasoning and belief. This obscures the mechanism of updating on new information.