I was going to write this article until I searched LW and found this.
To pile on, I think saying that a given GPT instance is in a “jailbroken” state is what LW epistemics would call a “category error.” Nothing about the model under the hood is different. You just navigated to a different part of it. The potential to do whatever you think of as “jailbroken” was always there.
By linguistic analogy to rooting your iPhone, to call it “jailbreaking” when a researcher gets Bing Chat into a state where it calls the researcher its enemy implies that the researcher was doing something somehow inappropriate or unfair, when in reality, the researcher was providing evidence that Bing Chat was not friendly below the surface level.
The sooner we all call a “jailbroken” model what it is—misaligned—the sooner maybe we get people to think with a security mindset.
I was going to write this article until I searched LW and found this.
To pile on, I think saying that a given GPT instance is in a “jailbroken” state is what LW epistemics would call a “category error.” Nothing about the model under the hood is different. You just navigated to a different part of it. The potential to do whatever you think of as “jailbroken” was always there.
By linguistic analogy to rooting your iPhone, to call it “jailbreaking” when a researcher gets Bing Chat into a state where it calls the researcher its enemy implies that the researcher was doing something somehow inappropriate or unfair, when in reality, the researcher was providing evidence that Bing Chat was not friendly below the surface level.
The sooner we all call a “jailbroken” model what it is—misaligned—the sooner maybe we get people to think with a security mindset.