Seems like this could be addressed by filtering out comments that use evidence or personal examples from your dataset.
If that’s too intense, filtering responses to remove personal examples and checking sources shouldn’t be too bad? But maybe you’d just end up with a model that tries to subvert the filter/draw misleading conclusions from sources instead of actually being helpful…
A hack like that would just have other EDT failure modes: instead of confabulating evidence from my dataset or personal examples, it might just confabulate references. “Yes, this was predicted by Foo et al 1990, and makes perfect sense.”
Seems like this could be addressed by filtering out comments that use evidence or personal examples from your dataset.
If that’s too intense, filtering responses to remove personal examples and checking sources shouldn’t be too bad? But maybe you’d just end up with a model that tries to subvert the filter/draw misleading conclusions from sources instead of actually being helpful…
A hack like that would just have other EDT failure modes: instead of confabulating evidence from my dataset or personal examples, it might just confabulate references. “Yes, this was predicted by Foo et al 1990, and makes perfect sense.”