*RLHF can be jail broken with prompts, so you can get it to tell you a sexy story or a recipe for methamphetamine. If we ever get to a point where LLMs know truly dangerous things, they’ll tell you those, too.
*Open source weights are fundamentally insecure, because you can finetune out the guardrails. Sexy stories, meth, or whatever.
The good thing about the War on Horny
probably doesnt really matter, so not much harm done when people get LLMx to write porn
Turns out, lots of people want to read porn (surprise! who would have guessed?) so there are lots of attackers trying to bypass the guardrails
This gives us good advance warning that the guardrails are worthless
Yeah, many of the issues are the same:
*RLHF can be jail broken with prompts, so you can get it to tell you a sexy story or a recipe for methamphetamine. If we ever get to a point where LLMs know truly dangerous things, they’ll tell you those, too.
*Open source weights are fundamentally insecure, because you can finetune out the guardrails. Sexy stories, meth, or whatever.
The good thing about the War on Horny
probably doesnt really matter, so not much harm done when people get LLMx to write porn
Turns out, lots of people want to read porn (surprise! who would have guessed?) so there are lots of attackers trying to bypass the guardrails
This gives us good advance warning that the guardrails are worthless