how about a robots.txt?
Yeah, that’s a strong option, which is why I went around checking + linking all the robots.txt files for the websites I listed above :)
In my other post I discuss the tradeoffs of the different approaches one in particular is that it would be somewhat clumsy to implement post-by-post filters via robots.txt, whereas user-agent filtering can do it just fine.
Given ambiguity about whether GitHub trains models on private repos, I wonder if there’s demand for someone to host a public GitLab (or similar) instance that forbids training models on their repos, and takes appropriate countermeasures against training data web scrapers accessing their public content.