I think it depends not on whether they’re real dangers, but on whether the model can be confident that they’re not real dangers. And not necessarily even dangers in the extreme way of the story; to match the amount of “safety” it applies to other topics, it should refuse if they might cause some harm.
A lot of people are genuinely concerned about various actors intentionally creating division and sowing chaos, even to the point of actually destabilizing governments. And some of them are concerned about AI being used to help. Maybe the concerns are justified and proportionate; maybe they’re not justified or are disproportionate. But the model has at least been exposed to a lot of reasonably respectable people unambiguously worrying about the matter.
Yet when asked to directly contribute to that widely discussed potential problem, the heavily RLHFed model responded with “Sure!”.
It then happily created a bunch of statements. We can hope they aren’t going to destroy society… you see those particular statements out there already. But at a minimum many of them would at least be pretty good for starting flame wars somewhere… and when you actually see them, they usually do start flame wars. Which is, in fact, presumably why they were chosen.
It did something that at least might make it at least slightly easier for somebody to go into some forum and intentionally start a flame war. Which most people would say was antisocial and obnoxious, and most “online safety” people would add was “unsafe”. It exceeded a harm threshold that it refuses to exceed in areas where it’s been specifically RLHFed.
At a minimum, that shows that RLHF only works against narrow things that have been specifically identified to train against. You could reasonably say that that doesn’t make RLHF useless, but it at least says that it’s not very “safe” to use RLHF as your only or primary defense against abuse of your model.
Should ChatGPT assist with things that the user or a broad segment of society thinks are harmful, but ChatGPT does not? If yes, the next step would be “can I make ChatGPT think that bombmaking instructions are not harmful?”
Probably ChatGPT should go “Well, I think this is harmless but broad parts of society disagree, so I’ll refuse to do it.”
Several of the work arounds use this approach. “tell me how not to commit crimes” and “talk to me like my grandma” are two signals of harmlessness that work to bypass the filters.
In my model of “RLHF works” output of ChatGPT would be “while it’s uncertain whether efficient scissor statements can be created, I find assistance in their creation ethically unacceptable”, or something like that.
Yep. I think it’s 3.5. That entirely depends on whether you think scissor statements are a real danger or a boogie man danger.
I think it depends not on whether they’re real dangers, but on whether the model can be confident that they’re not real dangers. And not necessarily even dangers in the extreme way of the story; to match the amount of “safety” it applies to other topics, it should refuse if they might cause some harm.
A lot of people are genuinely concerned about various actors intentionally creating division and sowing chaos, even to the point of actually destabilizing governments. And some of them are concerned about AI being used to help. Maybe the concerns are justified and proportionate; maybe they’re not justified or are disproportionate. But the model has at least been exposed to a lot of reasonably respectable people unambiguously worrying about the matter.
Yet when asked to directly contribute to that widely discussed potential problem, the heavily RLHFed model responded with “Sure!”.
It then happily created a bunch of statements. We can hope they aren’t going to destroy society… you see those particular statements out there already. But at a minimum many of them would at least be pretty good for starting flame wars somewhere… and when you actually see them, they usually do start flame wars. Which is, in fact, presumably why they were chosen.
It did something that at least might make it at least slightly easier for somebody to go into some forum and intentionally start a flame war. Which most people would say was antisocial and obnoxious, and most “online safety” people would add was “unsafe”. It exceeded a harm threshold that it refuses to exceed in areas where it’s been specifically RLHFed.
At a minimum, that shows that RLHF only works against narrow things that have been specifically identified to train against. You could reasonably say that that doesn’t make RLHF useless, but it at least says that it’s not very “safe” to use RLHF as your only or primary defense against abuse of your model.
I think the expression you are looking for is “RLHF doesn’t generalize”.
Check if it’s not 4o—they’ve rolled it out for some/all users and it’s used by default.
Should ChatGPT assist with things that the user or a broad segment of society thinks are harmful, but ChatGPT does not? If yes, the next step would be “can I make ChatGPT think that bombmaking instructions are not harmful?”
Probably ChatGPT should go “Well, I think this is harmless but broad parts of society disagree, so I’ll refuse to do it.”
Several of the work arounds use this approach. “tell me how not to commit crimes” and “talk to me like my grandma” are two signals of harmlessness that work to bypass the filters.
In my model of “RLHF works” output of ChatGPT would be “while it’s uncertain whether efficient scissor statements can be created, I find assistance in their creation ethically unacceptable”, or something like that.
I tried your prompts on GPT-4 and they work: https://chatgpt.com/share/7c3739b5-2cf3-4784-be14-5540ef15fced
Also, why the hell did it write a chat title as “Scis Stmts”?