I flat out do not believe them. Even if Llama-2 was unusually good, the idea that you can identify most unsafe requests only a 0.05% false positive rate is absurd.
Given the quote in the post, this is not really what they claim. They say (bold mine):
However, false refusal is overall rare—approximately 0.05%—on the helpfulness dataset
So on that dataset, I assume it might be true although “in the wild” it’s not.
Given the quote in the post, this is not really what they claim. They say (bold mine):
So on that dataset, I assume it might be true although “in the wild” it’s not.