I think there’s two levels of potential protection here. One is a security-like “LLMs must not see this” condition, for which yes, you need to do something that would keep out a human too (though in practice maybe “post only visible to logged-in users” is good enough).
However I also think there’s a lower level of protection that’s more like “if you give me the choice, on balance I’d prefer for LLMs not to be trained on this”, where some failures are OK and imperfect filtering is better than no filtering. The advantage of targeting this level is simply that it’s much easier and less obtrusive, so you can do it at a greater scale with a lower cost. I think this is still worth something.
Yeah, that’s a strong option, which is why I went around checking + linking all the robots.txt files for the websites I listed above :)
In my other post I discuss the tradeoffs of the different approaches one in particular is that it would be somewhat clumsy to implement post-by-post filters via robots.txt, whereas user-agent filtering can do it just fine.
I think there’s two levels of potential protection here. One is a security-like “LLMs must not see this” condition, for which yes, you need to do something that would keep out a human too (though in practice maybe “post only visible to logged-in users” is good enough).
However I also think there’s a lower level of protection that’s more like “if you give me the choice, on balance I’d prefer for LLMs not to be trained on this”, where some failures are OK and imperfect filtering is better than no filtering. The advantage of targeting this level is simply that it’s much easier and less obtrusive, so you can do it at a greater scale with a lower cost. I think this is still worth something.
I’m not sure I’m imagining the same thing as you, but as a draft solution, how about a
robots.txt
?Yeah, that’s a strong option, which is why I went around checking + linking all the robots.txt files for the websites I listed above :)
In my other post I discuss the tradeoffs of the different approaches one in particular is that it would be somewhat clumsy to implement post-by-post filters via robots.txt, whereas user-agent filtering can do it just fine.