I think this is a fun idea, but also, I think these explanations are mostly actually pretty bad, and at least my inner Eliezer is screaming at most of these rejected outputs, as well as the reasoning behind them.
I also don’t think it provides any more substantial robustness guarantees than the existing fine-tuning, though I do think if we train the model to be a really accurate Eliezer-simulator, that this approach has more hope (but that’s not the current training objective of either base-GPT3 or the helpful assistant model).
I also predict that real Eliezer would say about many of these things that they were basically not problematic outputs themselves, just represent how hard it is to stop outputs conditioned on having decided they are problematic. The model seems to totally not get this.
Meta level: let’s use these failures to understand how hard alignment is, but not accidentally start thinking that alignment==‘not providing information that is readily available on the internet but that we think people shouldn’t use’.
I think this is a fun idea, but also, I think these explanations are mostly actually pretty bad, and at least my inner Eliezer is screaming at most of these rejected outputs, as well as the reasoning behind them.
I also don’t think it provides any more substantial robustness guarantees than the existing fine-tuning, though I do think if we train the model to be a really accurate Eliezer-simulator, that this approach has more hope (but that’s not the current training objective of either base-GPT3 or the helpful assistant model).
I also predict that real Eliezer would say about many of these things that they were basically not problematic outputs themselves, just represent how hard it is to stop outputs conditioned on having decided they are problematic. The model seems to totally not get this.
Meta level: let’s use these failures to understand how hard alignment is, but not accidentally start thinking that alignment==‘not providing information that is readily available on the internet but that we think people shouldn’t use’.