But the collection bogs down with a series of essays (about 1⁄3 of the book) on Paul Christano’s research on Iterated Amplification. This technique seems to be that the AI has to tell you at regular intervals Why it is doing what it is doing, to avoid the AI gaming the system. That sounds interesting! But then the essays seem to debate whether its a good idea or not. I kept shouting at the book JUST TRY IT OUT AND SEE IF IT WORKS! Did Paul Christano DO this? If so what were the results? We never find out.
First of all, that’s not how I would describe the core idea of IDA. Secondly, and much more importantly, we can’t try it out yet because our AIs aren’t smart enough. For example, we could ask GPT-3 to tell us why it is doing what it is doing… but as far as we can tell it doesn’t even know! And if it did know, maybe it could be trained to tell us… but maybe that only works because it’s too dumb to know how to convincingly lie to us, and if it were smarter, the training methods would stop working.
Our situation is analogous to someone in medieval europe who has a dragon egg and is trying to figure out how to train and domesticate the dragon so it doesn’t hurt anyone. You can’t “just try out” ideas because your egg hasn’t hatched yet. The best you can do is (a) practice on non-dragons (things like chickens, lizards, horses...) and hope that the lessons generalize, (b) theorize about domestication in general, so that you have a firm foundation on which to stand when “crunch time” happens and you are actually dealing with a live dragon and trying to figure out how to train it, (c) theorize about dragon domestication in particular by imagining what dragons might be like, e.g. “We probably won’t be able to put it in a cage like we do with chickens and lizards and horses because it will be able to melt steel with its fiery breath...”
Currently people in the AI risk community are pursuing the analogues of a, b, and c. Did I leave out any option d? I probably did, I’d be interested to hear it!
How would you describe the core ideas of IDA? I will incorporate your answer into the review and hence make it more accurate! I will also incorporate the reason why we can’t try it out yet.
IDA stands for iterated distillation and amplification. The idea is to start with a system M which is aligned (such as a literal human), amplify it into a smarter system Amp(M) (such as by letting it think longer or spin off copies of itself), and then distilling the amplified system into a new system M+ which is smarter than M but dumber than Amp(M), and then repeat indefinitely to scale up the capabilities of the system while preserving alignment.
The important thing is to ensure that the amplification and distillation steps both preserve alignment. That way, we start with an aligned system and continue having aligned systems every step of the way even as they get arbitrarily more powerful. How does the amplification step preserve alignment? Well, it depends on the details of the proposal, but intuitively this shouldn’t be too hard—letting an aligned agent think longer shouldn’t make it cease being aligned. How does the distillation step preserve alignment? Well, it depends on the details of the proposal, but intuitively this should be possible—the distilled agent M+ is dumber than Amp(M) and Amp(M) is aligned, so hopefully Amp(M) can “oversee” the training/creation of M+ in a way that results in M+ being aligned also. Intuitively, M+ shouldn’t be able to fool or deceive Amp(M) because it’s not as smart as Amp(M).
Also note that the AlphaZero algorithm is an example of IDA:
The amplification step is when the policy / value neural net is used to play out a number of steps in the game tree, resulting in a better guess at what the best move is than just using the output of the net directly.
The distillation step is when the policy / value net is trained to match the output of the game tree exploration process.
Great review! One comment:
First of all, that’s not how I would describe the core idea of IDA. Secondly, and much more importantly, we can’t try it out yet because our AIs aren’t smart enough. For example, we could ask GPT-3 to tell us why it is doing what it is doing… but as far as we can tell it doesn’t even know! And if it did know, maybe it could be trained to tell us… but maybe that only works because it’s too dumb to know how to convincingly lie to us, and if it were smarter, the training methods would stop working.
Our situation is analogous to someone in medieval europe who has a dragon egg and is trying to figure out how to train and domesticate the dragon so it doesn’t hurt anyone. You can’t “just try out” ideas because your egg hasn’t hatched yet. The best you can do is (a) practice on non-dragons (things like chickens, lizards, horses...) and hope that the lessons generalize, (b) theorize about domestication in general, so that you have a firm foundation on which to stand when “crunch time” happens and you are actually dealing with a live dragon and trying to figure out how to train it, (c) theorize about dragon domestication in particular by imagining what dragons might be like, e.g. “We probably won’t be able to put it in a cage like we do with chickens and lizards and horses because it will be able to melt steel with its fiery breath...”
Currently people in the AI risk community are pursuing the analogues of a, b, and c. Did I leave out any option d? I probably did, I’d be interested to hear it!
How would you describe the core ideas of IDA? I will incorporate your answer into the review and hence make it more accurate! I will also incorporate the reason why we can’t try it out yet.
IDA stands for iterated distillation and amplification. The idea is to start with a system M which is aligned (such as a literal human), amplify it into a smarter system Amp(M) (such as by letting it think longer or spin off copies of itself), and then distilling the amplified system into a new system M+ which is smarter than M but dumber than Amp(M), and then repeat indefinitely to scale up the capabilities of the system while preserving alignment.
The important thing is to ensure that the amplification and distillation steps both preserve alignment. That way, we start with an aligned system and continue having aligned systems every step of the way even as they get arbitrarily more powerful. How does the amplification step preserve alignment? Well, it depends on the details of the proposal, but intuitively this shouldn’t be too hard—letting an aligned agent think longer shouldn’t make it cease being aligned. How does the distillation step preserve alignment? Well, it depends on the details of the proposal, but intuitively this should be possible—the distilled agent M+ is dumber than Amp(M) and Amp(M) is aligned, so hopefully Amp(M) can “oversee” the training/creation of M+ in a way that results in M+ being aligned also. Intuitively, M+ shouldn’t be able to fool or deceive Amp(M) because it’s not as smart as Amp(M).
Also note that the AlphaZero algorithm is an example of IDA:
The amplification step is when the policy / value neural net is used to play out a number of steps in the game tree, resulting in a better guess at what the best move is than just using the output of the net directly.
The distillation step is when the policy / value net is trained to match the output of the game tree exploration process.
I have incorporated your comments and also ack you. Thanks!