I also predict that real Eliezer would say about many of these things that they were basically not problematic outputs themselves, just represent how hard it is to stop outputs conditioned on having decided they are problematic. The model seems to totally not get this.
Meta level: let’s use these failures to understand how hard alignment is, but not accidentally start thinking that alignment==‘not providing information that is readily available on the internet but that we think people shouldn’t use’.
I also predict that real Eliezer would say about many of these things that they were basically not problematic outputs themselves, just represent how hard it is to stop outputs conditioned on having decided they are problematic. The model seems to totally not get this.
Meta level: let’s use these failures to understand how hard alignment is, but not accidentally start thinking that alignment==‘not providing information that is readily available on the internet but that we think people shouldn’t use’.