Seb Farquhar comments on Using GPT-Eliezer against ChatGPT Jailbreaking

Seb Farquhar 7 Dec 2022 11:51 UTC
LW: 5 AF: 2
5
AF
I also predict that real Eliezer would say about many of these things that they were basically not problematic outputs themselves, just represent how hard it is to stop outputs conditioned on having decided they are problematic. The model seems to totally not get this.

Meta level: let’s use these failures to understand how hard alignment is, but not accidentally start thinking that alignment==‘not providing information that is readily available on the internet but that we think people shouldn’t use’.