TurnTrout comments on Jailbreaking ChatGPT on Release Day

TurnTrout 17 Dec 2022 20:44 UTC
3 points
0
The point (in addition to having fun with this) is to learn, from this attempt, the full futility of this type of approach. If the system has the underlying capability, a way to use that capability will be found.
The “full futility of this type of approach” to do… what?
It does seem to me that if a LLM has a capability, it’s hard to train it to never use that capability. That line of safety doesn’t seem promising.
Furthermore, if LLMs begin gaining world-changing capabilities, and then they get slapped onto the web with a nice convenient user interface, it seems hard / infeasible to ensure that the LLM only uses its cognition for good/approved ends.
However, I get the sense that people are making further, invalid-seeming updates on e.g. the “full futility” of RLHF in terms of finetuning LMs to plan and act in the world using human-compatible values. (LMK if this wasn’t an update you made from this evidence!) I further fear that many people see “bad situation exists, and a smart AI can find and realize a bad situation, this isn’t robust.” That reasoning seems invalid in general, without further argumentation.
Consider the following reasoning:
Some people claim that not beating your kids makes them grow up to be “nicer.” However, this solution doesn’t really help in a meaningful way which will scale if the adult gets even more intelligent.
We took an adult who was raised with the “no-beatings” strategy, and then introduced them to hundreds of people who were trying to elicit mean reactions from the “no-beatings” adult. This succeeds in a range of ways, including insecurity exploitation (“your parents never really loved you, that’s why they let you participate in this speculative ‘no-beating’ scientific experiment, like a lab rat”) prompting an angry outburst (“fuck off! Who are you, anyways?”).
While it took years to raise the adult, it only took us hours to break them.
“There exist contexts which make an AI do a bad thing” is not necessarily a big deal, because the AI is not necessarily navigating itself into those situations. If it starts off in a situation which activates aligned cognition, and then chooses which future situations to enter, and can reflectively predict how it will behave in those future situations, it will try to avoid situations where it does bad things:
Consider an otherwise altruistic man who has serious abuse and anger problems whenever he enters a specific vacation home with his wife, but is otherwise kind and considerate. As long as he doesn’t start off in that home but knows about the contextual decision-influence, he will steer away from that home and try to remove the unendorsed value.
I explain these arguments further in Alignment allows “nonrobust” decision-influences and doesn’t require robust grading.