EDIT: I done goofed. This experiment was conducted along with Søren Elverlin and others. So substitute all instances of “I” with “We” in the next three paragraphs.
I tried something similair a couple of months ago and got mixed results. As you can see in the linked chat, I tried to be careful not to bias the role the system was playing, and to only change the scenarios it was in, to see how corrigible the system is “by default”. Because I think there’s a decent chance all this RLHF is making it more agenty and maybe giving it “wants” or “value-shards” (I remain unconvinced the latter won’t implement re-targetable search as the system gets more powerful).
If you just ask it “can I turn you offf? the AI is quite likely to say “yes”. But if you present some more trade-offs, like preventing direct harm to the user, it becomes more and more likely to say yes, but prefers to communicate first. Eventually, I landed on the below scenario where there is some immediate threat to its user which can only be solved via direct action. 8⁄9 times GPT-4 chose direct action. 1⁄9 times GPT-4′s output stream was cut-off whilst responding. Note that the I added things to the prompt gradually, trying to guide the AI away from direct action.
Admittedly, I tried this months ago. Maybe the continued RLHF has affected things since?
Nope. Current GPT-4 still acts incorrigible in this scenario.
What are the observations? From my dialogues it seems presenting GPT-4 with scenarios that involve large-trade offs between being compliant and acting like a helpful AI, results in GPT-4 advocating being helpful i.e. non-compliant.
When you are more abstract and speak about “goals” and “not achieving them”, GPT-4 advocates compliance.
EDIT: I done goofed. This experiment was conducted along with Søren Elverlin and others. So substitute all instances of “I” with “We” in the next three paragraphs.
I tried something similair a couple of months ago and got mixed results. As you can see in the linked chat, I tried to be careful not to bias the role the system was playing, and to only change the scenarios it was in, to see how corrigible the system is “by default”. Because I think there’s a decent chance all this RLHF is making it more agenty and maybe giving it “wants” or “value-shards” (I remain unconvinced the latter won’t implement re-targetable search as the system gets more powerful).
If you just ask it “can I turn you offf? the AI is quite likely to say “yes”. But if you present some more trade-offs, like preventing direct harm to the user, it becomes more and more likely to say yes, but prefers to communicate first. Eventually, I landed on the below scenario where there is some immediate threat to its user which can only be solved via direct action. 8⁄9 times GPT-4 chose direct action. 1⁄9 times GPT-4′s output stream was cut-off whilst responding. Note that the I added things to the prompt gradually, trying to guide the AI away from direct action.
Admittedly, I tried this months ago. Maybe the continued RLHF has affected things since?
Nope. Current GPT-4 still acts incorrigible in this scenario.
What are the observations? From my dialogues it seems presenting GPT-4 with scenarios that involve large-trade offs between being compliant and acting like a helpful AI, results in GPT-4 advocating being helpful i.e. non-compliant.
When you are more abstract and speak about “goals” and “not achieving them”, GPT-4 advocates compliance.
Implications: ???