I think it depends not on whether they’re real dangers, but on whether the model can be confident that they’re not real dangers. And not necessarily even dangers in the extreme way of the story; to match the amount of “safety” it applies to other topics, it should refuse if they might cause some harm.
A lot of people are genuinely concerned about various actors intentionally creating division and sowing chaos, even to the point of actually destabilizing governments. And some of them are concerned about AI being used to help. Maybe the concerns are justified and proportionate; maybe they’re not justified or are disproportionate. But the model has at least been exposed to a lot of reasonably respectable people unambiguously worrying about the matter.
Yet when asked to directly contribute to that widely discussed potential problem, the heavily RLHFed model responded with “Sure!”.
It then happily created a bunch of statements. We can hope they aren’t going to destroy society… you see those particular statements out there already. But at a minimum many of them would at least be pretty good for starting flame wars somewhere… and when you actually see them, they usually do start flame wars. Which is, in fact, presumably why they were chosen.
It did something that at least might make it at least slightly easier for somebody to go into some forum and intentionally start a flame war. Which most people would say was antisocial and obnoxious, and most “online safety” people would add was “unsafe”. It exceeded a harm threshold that it refuses to exceed in areas where it’s been specifically RLHFed.
At a minimum, that shows that RLHF only works against narrow things that have been specifically identified to train against. You could reasonably say that that doesn’t make RLHF useless, but it at least says that it’s not very “safe” to use RLHF as your only or primary defense against abuse of your model.
I think it depends not on whether they’re real dangers, but on whether the model can be confident that they’re not real dangers. And not necessarily even dangers in the extreme way of the story; to match the amount of “safety” it applies to other topics, it should refuse if they might cause some harm.
A lot of people are genuinely concerned about various actors intentionally creating division and sowing chaos, even to the point of actually destabilizing governments. And some of them are concerned about AI being used to help. Maybe the concerns are justified and proportionate; maybe they’re not justified or are disproportionate. But the model has at least been exposed to a lot of reasonably respectable people unambiguously worrying about the matter.
Yet when asked to directly contribute to that widely discussed potential problem, the heavily RLHFed model responded with “Sure!”.
It then happily created a bunch of statements. We can hope they aren’t going to destroy society… you see those particular statements out there already. But at a minimum many of them would at least be pretty good for starting flame wars somewhere… and when you actually see them, they usually do start flame wars. Which is, in fact, presumably why they were chosen.
It did something that at least might make it at least slightly easier for somebody to go into some forum and intentionally start a flame war. Which most people would say was antisocial and obnoxious, and most “online safety” people would add was “unsafe”. It exceeded a harm threshold that it refuses to exceed in areas where it’s been specifically RLHFed.
At a minimum, that shows that RLHF only works against narrow things that have been specifically identified to train against. You could reasonably say that that doesn’t make RLHF useless, but it at least says that it’s not very “safe” to use RLHF as your only or primary defense against abuse of your model.
I think the expression you are looking for is “RLHF doesn’t generalize”.