How would you describe the core ideas of IDA? I will incorporate your answer into the review and hence make it more accurate! I will also incorporate the reason why we can’t try it out yet.
IDA stands for iterated distillation and amplification. The idea is to start with a system M which is aligned (such as a literal human), amplify it into a smarter system Amp(M) (such as by letting it think longer or spin off copies of itself), and then distilling the amplified system into a new system M+ which is smarter than M but dumber than Amp(M), and then repeat indefinitely to scale up the capabilities of the system while preserving alignment.
The important thing is to ensure that the amplification and distillation steps both preserve alignment. That way, we start with an aligned system and continue having aligned systems every step of the way even as they get arbitrarily more powerful. How does the amplification step preserve alignment? Well, it depends on the details of the proposal, but intuitively this shouldn’t be too hard—letting an aligned agent think longer shouldn’t make it cease being aligned. How does the distillation step preserve alignment? Well, it depends on the details of the proposal, but intuitively this should be possible—the distilled agent M+ is dumber than Amp(M) and Amp(M) is aligned, so hopefully Amp(M) can “oversee” the training/creation of M+ in a way that results in M+ being aligned also. Intuitively, M+ shouldn’t be able to fool or deceive Amp(M) because it’s not as smart as Amp(M).
Also note that the AlphaZero algorithm is an example of IDA:
The amplification step is when the policy / value neural net is used to play out a number of steps in the game tree, resulting in a better guess at what the best move is than just using the output of the net directly.
The distillation step is when the policy / value net is trained to match the output of the game tree exploration process.
How would you describe the core ideas of IDA? I will incorporate your answer into the review and hence make it more accurate! I will also incorporate the reason why we can’t try it out yet.
IDA stands for iterated distillation and amplification. The idea is to start with a system M which is aligned (such as a literal human), amplify it into a smarter system Amp(M) (such as by letting it think longer or spin off copies of itself), and then distilling the amplified system into a new system M+ which is smarter than M but dumber than Amp(M), and then repeat indefinitely to scale up the capabilities of the system while preserving alignment.
The important thing is to ensure that the amplification and distillation steps both preserve alignment. That way, we start with an aligned system and continue having aligned systems every step of the way even as they get arbitrarily more powerful. How does the amplification step preserve alignment? Well, it depends on the details of the proposal, but intuitively this shouldn’t be too hard—letting an aligned agent think longer shouldn’t make it cease being aligned. How does the distillation step preserve alignment? Well, it depends on the details of the proposal, but intuitively this should be possible—the distilled agent M+ is dumber than Amp(M) and Amp(M) is aligned, so hopefully Amp(M) can “oversee” the training/creation of M+ in a way that results in M+ being aligned also. Intuitively, M+ shouldn’t be able to fool or deceive Amp(M) because it’s not as smart as Amp(M).
Also note that the AlphaZero algorithm is an example of IDA:
The amplification step is when the policy / value neural net is used to play out a number of steps in the game tree, resulting in a better guess at what the best move is than just using the output of the net directly.
The distillation step is when the policy / value net is trained to match the output of the game tree exploration process.
I have incorporated your comments and also ack you. Thanks!