Seems like this could be addressed by filtering out comments that use evidence or personal examples from your dataset.
If that’s too intense, filtering responses to remove personal examples and checking sources shouldn’t be too bad? But maybe you’d just end up with a model that tries to subvert the filter/draw misleading conclusions from sources instead of actually being helpful…
I think for those cases you’re better off using standard methods (multiple choice etc.), this technique is only useful when paired positive negative data is more difficult to create (like writing imitation).