Maybe the trawler problem would be mitigated if lesswrong offered a daily XML or plaintext or whathever dump on a different URL and announced it in robots.txt?
Epistemic status: Late night hot take, notting it down so I don’t forget it. Not endorsed. Asked in the spirit of a question post. I am aware that people may respond both “ehm we are already that” and “no! we don’t give in to threats!”. I don’t know.
The URLS they crawl are already blocked by our robots.txt, and they are actively sending requests from thousands of different IPs with realistic randomly sampled user-agents to prevent any algorithmic blocking.
I mean, how do we enforce the rate limits when every request comes from a different IP?
Edit: Ah, you mean, maybe a whole joint queue. Yeah, that’s not crazy, we were thinking of doing something that prioritizes requests by users who are not first-time users, for that reason. I am a bit scared of it because it possibly just pushes the problem under the rug in a way that has large costs, but removes any feedback we get about it (because anyone who would tell us the site is behaving badly for them is now in the prioritized category, but we are missing out on growth because new users often have a bad experience).
Maybe the trawler problem would be mitigated if lesswrong offered a daily XML or plaintext or whathever dump on a different URL and announced it in robots.txt?
Epistemic status: Late night hot take, notting it down so I don’t forget it. Not endorsed. Asked in the spirit of a question post. I am aware that people may respond both “ehm we are already that” and “no! we don’t give in to threats!”. I don’t know.
The URLS they crawl are already blocked by our robots.txt, and they are actively sending requests from thousands of different IPs with realistic randomly sampled user-agents to prevent any algorithmic blocking.
Yeah, this is borderline DDoS behavior, pretty hostile. Maybe a resource-limited queue for first-requests from unauthorized users?
I mean, how do we enforce the rate limits when every request comes from a different IP?
Edit: Ah, you mean, maybe a whole joint queue. Yeah, that’s not crazy, we were thinking of doing something that prioritizes requests by users who are not first-time users, for that reason. I am a bit scared of it because it possibly just pushes the problem under the rug in a way that has large costs, but removes any feedback we get about it (because anyone who would tell us the site is behaving badly for them is now in the prioritized category, but we are missing out on growth because new users often have a bad experience).
Yeah, hurting new user experience is a risk for sure.
Maybe just switch over to the group queue system temporarily when a deluge hits?
Wouldn’t surprise me if this got more common over time, such that what is now a deluge will become the new baseline.
I see. Some pretty underhanded trawlers then, to put it mildly.