Note: This post came out of a conversation with Geoffrey Irving and Buck Shlegeris.
Epistemic Status: I suspect Paul has already thought of most or all of the ideas presented here, though I nevertheless found the exercise of carefully specifying an IDA implementation helpful and suspect others may find reading it helpful as well.
This is a proposal for how to train a machine learning model to approximate HCH using Iterated Distillation and Amplification (IDA). This particular proposal came out of a desire to use a debate-like adversary to improve the amplification process, and the primary goal of this proposal is to show how one could do that. Though I have tried to retain a lot of the relevant detail, I have made two simplifications to make this proposal easier to specify: I am attempting to approximate something closer to weak HCH rather than strong HCH and I am only allowing the generation of two subquestions at a time. I am confident that those simplifications could easily be dropped, though I think doing so here would only make this presentation more complicated.
Before I proceed, I want to make one final note: this is not a proposal for how to build an aligned AGI. I think there are still a whole bunch of issues that would prevent this proposal from actually working.
Definitions
We will start with some initial definitions:
Let Q be the set of all questions in natural language.
Let A be the set of all answers in natural language.
Let M be the sum type of either Q×Q or A representing either an answer to the given question or two subquestions to help answer it.
Let H:Q→A be the answer that a human gives to the given question.
Let Hfan out:Q→M be the answer or subquestion pair generated by a human when asked what to do with the given question.
Let Hfan in:Q×(Q×A)×(Q×A)→M be the answer or two subquestions generated by a human to some question when given answers to two subquestions related to that question.
Let ML:Q→Δ(A) be a model (the training procedure for which we will describe below) from questions to a probability distribution over strings representing answers. Specifically, we will implement the probability distribution by having our model output an embedding vector which yields the probability distribution when fed into some trained language model (by repeatedly conditioning on previous characters and multiplying all the conditional probabilities).
Let MLfan out:Q→Δ(M) be the function from questions to an embedding vector representing a distribution over answers or subquestions generated by asking ML what to do with the given question. For the two subquestion case we enforce that the distribution is symmetric wrt interchange of the subquestions.
Let MLfan in:Q×(Q×A)×(Q×A)→Δ(M) be the function from two answers to an embedding vector representing a distribution over answers or subquestions generated by asking ML to integrate the given subquestion answers into an answer to the original question. We again enforce symmetry wrt interchange of the subquestions.
Let Adv:Q×A→R be an adversary model which we will train (as described below) to predict how good the given answer is to the given question.
Let Advfan out:Q×M→R be an adversary model for MLfan out generated by calling Adv using the same transformation as used to implement MLfan out from ML.
Let Advfan in:Q×(Q×A)×(Q×A)×M→R be an adversary model for MLfan in generated by calling Adv using the same transformation as used to implement MLfan in from ML.
Let ϵ be some Gumbel random variable such that each use of ϵ below is its own independent sample from a Gumbel distribution. We use a Gumbel random variable so that we can use the Gumbel-max trick to effectively do Boltzmann exploration.
Amplification
The goal here is to use the adversary to determine which portions of the HCH tree we actually go down to maximize our sample efficiency wrt H. Specifically, the amplification procedure is as follows:
1. Generate some question Q∈Q either from the training distribution or from the output of some adversary.
2. Call Hfan out(Q). If we get an answer, let A=Hfan out(Q) and skip to the end. Otherwise, let SQ1,SQ2=Hfan out(Q).
3. Call Adv to generate scores S1=Adv(SQ1,ML(SQ1)) and S2=Adv(SQ2,ML(SQ2)).
4. Without loss of generality, let S1+ϵ≤S2+ϵ and recurse on SQ1 (by setting Q=SQ1 in the subcall) to generate A1 (from the A of the subcall) and let A2=ML(SQ2).
5. Call Hfan in(Q,(Q1,A1),(Q2,A2)). If we get an answer, let A=Hfan in(Q,(Q1,A1),(Q2,A2)). Otherwise, let SQ1,SQ2=Hfan in(Q,(Q1,A1),(Q2,A2)) and return to step 3.
This procedure should eventually produce an answer, though we then need to compute the following losses for ML,MLfan out,MLfan in,Adv,Advfan out,Advfan in:
Note that some of these losses are actually for the same model since MLfan out,MLfan in are implemented internally by calling ML and Advfan out,Advfan in are implemented internally by calling Adv.
Distillation
For the distillation procedure, we want to improve ML by training it to approximate its own ML-consulting-ML tree. Specifically:
1. Generate some Q∈Q as in amplification.
2. Call MLfan out(Q). If we get an answer, let A=MLfan out(Q) and skip to the end. Otherwise, let SQ1,SQ2=MLfan out(Q).
3. Let A1=ML(SQ1) and A2=ML(SQ2).
4. Call MLfan in(Q,(Q1,A1),(Q2,A2)). If we get an answer, let A=MLfan in(Q,(Q1,A1),(Q2,A2)). Otherwise, let SQ1,SQ2=MLfan in(Q,(Q1,A1),(Q2,A2)) and return to step 3.
Then, compute the following losses for ML,Adv:
L=−log(ML(Q)|A)LAdv=|Adv(Q,A)+L|2
Iteration
Finally, we want to iterate the above procedure by replacing H,Hfan out,Hfan in in the amplification procedure with some H′,H′fan out,H′fan in. First, let α be some confidence threshold. Then, we will define the following primed Hs:
Let H′(Q)=if Adv(Q,ML(Q))+ϵ<α+ϵ then H(Q) else ML(Q).
This procedure allows us to continue amplifying the model while using the adversary to require only minimal human data that is selected so as to be maximally helpful.
Conclusion
This proposal differs in a couple of ways from previous proposals made by Paul. First, Paul has recently moved away from discrete amplification/distillation steps. This proposal, however, provides a way to recover discrete steps while still collapsing the recursion. In practice, however, you might still just want to stick with the amplification procedure described here without doing the distillation step, as it isn’t strictly necessary.
Second, this proposal uses an adversary to guide the training process. This technique is similar to the concept of importance sampling. The main benefit of this approach is that it takes advantage of active learning by allowing the system to choose which questions and subquestions would be most useful for it to have answered by a human.
Another benefit of the adversary, however, is that it could make transparency much easier. One of the major benefits of IDA is that ML gets trained to approximate its own ML-consulting-ML tree. As a result, understanding the reasoning that went into the final answer produced by ML can be recovered by unfolding its tree (at least in the limit of perfect training). However, unfolding the entire tree is very expensive, as it’s linear in the size of the tree. With an adversary, however, you can choose which portions of the tree to unfold first by calling the adversary, enabling you to find errors much more quickly; for a perfect adversary, this reduces the problem of finding an error to O(logn) instead of O(n).
Thus, the hope is that the use of such an adversary could assist both in making IDA more competitive (by increasing sample efficiency and using active learning) and in making IDA safer (due to the increased ease of transparency).
It should be noted, however, that it is also possible that the use of such an adversary might make the safety situation for IDA worse. First, it introduces the possibility of a robustness to relative scale failure if either ML or Adv gets significantly stronger than the other. One possible way to resolve such an issue, however, might be to give Adv the ability to call ML and vice versa, allowing them to use each other to boost their own capabilities. Second, for an ML and Adv system that are themselves optimizers, with goals that don’t perfectly match up with their loss functions, they could cooperate to make it arbitrarily unlikely that H is ever consulted on some specific question. Third, even if ML and Adv weren’t cooperating, an RSA-2048-style failure could still prevent the identification of malicious cognition. Resolving failures of these second two types is still an open question (EDIT: see “Risks from Learned Optimization in Advanced Machine Learning Systems,” by Hubinger, van Merwijk, Mikulik, Skalse, and Garrabrant).
A Concrete Proposal for Adversarial IDA
Note: This post came out of a conversation with Geoffrey Irving and Buck Shlegeris.
Epistemic Status: I suspect Paul has already thought of most or all of the ideas presented here, though I nevertheless found the exercise of carefully specifying an IDA implementation helpful and suspect others may find reading it helpful as well.
This is a proposal for how to train a machine learning model to approximate HCH using Iterated Distillation and Amplification (IDA). This particular proposal came out of a desire to use a debate-like adversary to improve the amplification process, and the primary goal of this proposal is to show how one could do that. Though I have tried to retain a lot of the relevant detail, I have made two simplifications to make this proposal easier to specify: I am attempting to approximate something closer to weak HCH rather than strong HCH and I am only allowing the generation of two subquestions at a time. I am confident that those simplifications could easily be dropped, though I think doing so here would only make this presentation more complicated.
Before I proceed, I want to make one final note: this is not a proposal for how to build an aligned AGI. I think there are still a whole bunch of issues that would prevent this proposal from actually working.
Definitions
We will start with some initial definitions:
Let Q be the set of all questions in natural language.
Let A be the set of all answers in natural language.
Let M be the sum type of either Q×Q or A representing either an answer to the given question or two subquestions to help answer it.
Let H:Q→A be the answer that a human gives to the given question.
Let Hfan out:Q→M be the answer or subquestion pair generated by a human when asked what to do with the given question.
Let Hfan in:Q×(Q×A)×(Q×A)→M be the answer or two subquestions generated by a human to some question when given answers to two subquestions related to that question.
Let ML:Q→Δ(A) be a model (the training procedure for which we will describe below) from questions to a probability distribution over strings representing answers. Specifically, we will implement the probability distribution by having our model output an embedding vector which yields the probability distribution when fed into some trained language model (by repeatedly conditioning on previous characters and multiplying all the conditional probabilities).
Let MLfan out:Q→Δ(M) be the function from questions to an embedding vector representing a distribution over answers or subquestions generated by asking ML what to do with the given question. For the two subquestion case we enforce that the distribution is symmetric wrt interchange of the subquestions.
Let MLfan in:Q×(Q×A)×(Q×A)→Δ(M) be the function from two answers to an embedding vector representing a distribution over answers or subquestions generated by asking ML to integrate the given subquestion answers into an answer to the original question. We again enforce symmetry wrt interchange of the subquestions.
Let Adv:Q×A→R be an adversary model which we will train (as described below) to predict how good the given answer is to the given question.
Let Advfan out:Q×M→R be an adversary model for MLfan out generated by calling Adv using the same transformation as used to implement MLfan out from ML.
Let Advfan in:Q×(Q×A)×(Q×A)×M→R be an adversary model for MLfan in generated by calling Adv using the same transformation as used to implement MLfan in from ML.
Let ϵ be some Gumbel random variable such that each use of ϵ below is its own independent sample from a Gumbel distribution. We use a Gumbel random variable so that we can use the Gumbel-max trick to effectively do Boltzmann exploration.
Amplification
The goal here is to use the adversary to determine which portions of the HCH tree we actually go down to maximize our sample efficiency wrt H. Specifically, the amplification procedure is as follows:
1. Generate some question Q∈Q either from the training distribution or from the output of some adversary.
2. Call Hfan out(Q). If we get an answer, let A=Hfan out(Q) and skip to the end. Otherwise, let SQ1,SQ2=Hfan out(Q).
3. Call Adv to generate scores S1=Adv(SQ1,ML(SQ1)) and S2=Adv(SQ2,ML(SQ2)).
4. Without loss of generality, let S1+ϵ≤S2+ϵ and recurse on SQ1 (by setting Q=SQ1 in the subcall) to generate A1 (from the A of the subcall) and let A2=ML(SQ2).
5. Call Hfan in(Q,(Q1,A1),(Q2,A2)). If we get an answer, let A=Hfan in(Q,(Q1,A1),(Q2,A2)). Otherwise, let SQ1,SQ2=Hfan in(Q,(Q1,A1),(Q2,A2)) and return to step 3.
This procedure should eventually produce an answer, though we then need to compute the following losses for ML,MLfan out,MLfan in,Adv,Advfan out,Advfan in:
Note that some of these losses are actually for the same model since MLfan out,MLfan in are implemented internally by calling ML and Advfan out,Advfan in are implemented internally by calling Adv.
Distillation
For the distillation procedure, we want to improve ML by training it to approximate its own ML-consulting-ML tree. Specifically:
1. Generate some Q∈Q as in amplification.
2. Call MLfan out(Q). If we get an answer, let A=MLfan out(Q) and skip to the end. Otherwise, let SQ1,SQ2=MLfan out(Q).
3. Let A1=ML(SQ1) and A2=ML(SQ2).
4. Call MLfan in(Q,(Q1,A1),(Q2,A2)). If we get an answer, let A=MLfan in(Q,(Q1,A1),(Q2,A2)). Otherwise, let SQ1,SQ2=MLfan in(Q,(Q1,A1),(Q2,A2)) and return to step 3.
Then, compute the following losses for ML,Adv:
Iteration
Finally, we want to iterate the above procedure by replacing H,Hfan out,Hfan in in the amplification procedure with some H′,H′fan out,H′fan in. First, let α be some confidence threshold. Then, we will define the following primed Hs:
Let H′(Q)=if Adv(Q,ML(Q))+ϵ<α+ϵ then H(Q) else ML(Q).
Let
Let
This procedure allows us to continue amplifying the model while using the adversary to require only minimal human data that is selected so as to be maximally helpful.
Conclusion
This proposal differs in a couple of ways from previous proposals made by Paul. First, Paul has recently moved away from discrete amplification/distillation steps. This proposal, however, provides a way to recover discrete steps while still collapsing the recursion. In practice, however, you might still just want to stick with the amplification procedure described here without doing the distillation step, as it isn’t strictly necessary.
Second, this proposal uses an adversary to guide the training process. This technique is similar to the concept of importance sampling. The main benefit of this approach is that it takes advantage of active learning by allowing the system to choose which questions and subquestions would be most useful for it to have answered by a human.
Another benefit of the adversary, however, is that it could make transparency much easier. One of the major benefits of IDA is that ML gets trained to approximate its own ML-consulting-ML tree. As a result, understanding the reasoning that went into the final answer produced by ML can be recovered by unfolding its tree (at least in the limit of perfect training). However, unfolding the entire tree is very expensive, as it’s linear in the size of the tree. With an adversary, however, you can choose which portions of the tree to unfold first by calling the adversary, enabling you to find errors much more quickly; for a perfect adversary, this reduces the problem of finding an error to O(logn) instead of O(n).
Thus, the hope is that the use of such an adversary could assist both in making IDA more competitive (by increasing sample efficiency and using active learning) and in making IDA safer (due to the increased ease of transparency).
It should be noted, however, that it is also possible that the use of such an adversary might make the safety situation for IDA worse. First, it introduces the possibility of a robustness to relative scale failure if either ML or Adv gets significantly stronger than the other. One possible way to resolve such an issue, however, might be to give Adv the ability to call ML and vice versa, allowing them to use each other to boost their own capabilities. Second, for an ML and Adv system that are themselves optimizers, with goals that don’t perfectly match up with their loss functions, they could cooperate to make it arbitrarily unlikely that H is ever consulted on some specific question. Third, even if ML and Adv weren’t cooperating, an RSA-2048-style failure could still prevent the identification of malicious cognition. Resolving failures of these second two types is still an open question (EDIT: see “Risks from Learned Optimization in Advanced Machine Learning Systems,” by Hubinger, van Merwijk, Mikulik, Skalse, and Garrabrant).