Thar depends upon the details of the tagging strategy. For example should the sentence:
Alice said “Bob is lying!”
be tagged with <deceit> tags? Assuming that Alice did in fact say that, and was not being deceitful when she did so, then I would argue that the optimal answer is no. Using that tagging strategy, then an agent currently blocked from emitting the <deceit> tag could still emit that output. So it could also honestly warm us that (in its opinion) someone else was lying, but not dishonestly claim the same thing. So we need to tag active examples of the bad behavior, but not discussion of it. So yes, the classifiers need to understand the use-mention distinction. For other tags, such as <nsfw>, we might need a slightly different strategy: for example it’s generally still <sfw> if you use clinical terminology and are as brief, abstract and nonspecific as you can while still making a necessary point.
So our classifiers need to understand some moderately subtle logical and social/cultural distinctions: the sorts of things the GPT-4 can already do, and later generations of LLM will be doubtless even better at.
Thar depends upon the details of the tagging strategy. For example should the sentence:
be tagged with <deceit> tags? Assuming that Alice did in fact say that, and was not being deceitful when she did so, then I would argue that the optimal answer is no. Using that tagging strategy, then an agent currently blocked from emitting the <deceit> tag could still emit that output. So it could also honestly warm us that (in its opinion) someone else was lying, but not dishonestly claim the same thing. So we need to tag active examples of the bad behavior, but not discussion of it. So yes, the classifiers need to understand the use-mention distinction. For other tags, such as <nsfw>, we might need a slightly different strategy: for example it’s generally still <sfw> if you use clinical terminology and are as brief, abstract and nonspecific as you can while still making a necessary point.
So our classifiers need to understand some moderately subtle logical and social/cultural distinctions: the sorts of things the GPT-4 can already do, and later generations of LLM will be doubtless even better at.
P.S. I briefly clarified this in the post,