I was confused by Buck’s response here because I thought we were going for worst-case quality until I realised:
The model will have low quality on those prompts almost by definition—that’s the goal.
Given that, we also want to have a generally useful model—for which the relevant distribution is ‘all fanfiction’, not “prompts that are especially likely to have a violent continuation”.
In between those two cases is ‘snippets that were completed injuriously in the original fanfic … but could plausibly have non-violent completions’, which seems like the interesting case to me.
I suppose one possibility is to construct a human-labelled dataset of specifically these cases to evaluate on.
I was confused by Buck’s response here because I thought we were going for worst-case quality until I realised:
The model will have low quality on those prompts almost by definition—that’s the goal.
Given that, we also want to have a generally useful model—for which the relevant distribution is ‘all fanfiction’, not “prompts that are especially likely to have a violent continuation”.
In between those two cases is ‘snippets that were completed injuriously in the original fanfic … but could plausibly have non-violent completions’, which seems like the interesting case to me.
I suppose one possibility is to construct a human-labelled dataset of specifically these cases to evaluate on.