We know that Twitter was in multiple datacenters, including their own datacenter, plus Google Cloud, plus AWS. We know that they were trying to get out of these contracts, possibly using default as a negotiating tactic. We know that their technical-debt level was extraordinarily bad. After joking that they probably had a history of their engineers being spies who obfuscating things in order to make it easier to hide their sponsors’ backdoors, I thought about it a bit more and decided that was probably literally true. They were (and probably still are) using orders of magnitude more computing resources than running a service like Twitter ought to take, if it was well engineered. And we know that they started having capacity problems, with timing that seems to line up suspicioously with what we might infer is a monthly billing cycle.
But there are a bunch of very different interpretations of this, which we can’t easily distinguish:
They defaulted on their bill with GCP or AWS, had a bunch of servers shut off, and discovered they no longer had sufficient capacity;
They declined to renew or scaled back their expenditure with GCP or AWS, thinking that their remaining compute resources were adequate, but they weren’t;
They declined to renew or scaled back their expenditure with GCP or AWS, and the remaining capacity was adequate, but they had problems with the migration that would be easier to fix given extra capacity-headroom;
They’re using the same servers as before, and had problems with some crawler/scraper changing its activity patterns
One thing I can say, from running LW, is that from a capacity perspective crawlers are a much bigger issue for websites than you’d naively expect. (LessWrong does have rate limits per-IP-address, they’re just high enough that you won’t ever hit them under normal usage.) So even if there was a capacity-reduction related to their supply of hardware, it may still be the case that most of their capacity was going to scrapers, and to try to limit the scrapers as a way of regaining capacity. It seems fairly likely that the rate-limiting option was set up in advance as a quick-response option for any sort of capacity issue (including capacity issues created by things like a developer accidentally deploying slow code, or surges in usage).
The main problem with crawlers is that their usage patterns don’t match those of regular users, and most optimization effort is focused on the usage patterns of real users, so bots sometimes wind up using the site in ways that consume orders of magnitude more compute per request than a regular user would. And some of these bots have been through many iterations of detection and counter-detection, and are routing their requests through residential-IP botnets, with fake user-agent strings trying to approximate real web browsers.
As for the shadowbanning thing—the real bug was probably a bit more subtle than the tweet-length description, but the bug itself is not surprising, and given the high-technical-debt codebase, probably not nearly as stupid as it sounds. Or rather: the effect may have been that stupid, but the code itself probably didn’t look on cursory inspection like it was that bad. Ie, I would assign pretty high probability to that code containing an attempt to normalize for visibility that didn’t work correctly, or an uncompleted todo-item to make a visibility correction score. A codebase like Twitter is going to have bugs like this, they can ony be discovered by skilled programmers doing forensic investigations, and executives will only know about them during the narrow time window between when they’re discovered and when they’re fixed.
The main problem with crawlers is that their usage patterns don’t match those of regular users, and most optimization effort is focused on the usage patterns of real users, so bots sometimes wind up using the site in ways that consume orders of magnitude more compute per request than a regular user would.
And Twitter has recently destroyed his API, I think? Which perhaps has the effect of de-optimizing the usage patterns of bots.
And some of these bots have been through many iterations of detection and counter-detection, and are routing their requests through residential-IP botnets, with fake user-agent strings trying to approximate real web browsers.
As someone who has done scraping a few times, I can confirm that it’s trivial to circumvent protections against it, even for a novice programmer. In most cases, it’s literally less than 10 minutes of googling and trial & error.
And for a major AI / web-search company, it could be a routine task, with teams of dedicated professionals working on it.
We know that Twitter was in multiple datacenters, including their own datacenter, plus Google Cloud, plus AWS. We know that they were trying to get out of these contracts, possibly using default as a negotiating tactic. We know that their technical-debt level was extraordinarily bad. After joking that they probably had a history of their engineers being spies who obfuscating things in order to make it easier to hide their sponsors’ backdoors, I thought about it a bit more and decided that was probably literally true. They were (and probably still are) using orders of magnitude more computing resources than running a service like Twitter ought to take, if it was well engineered. And we know that they started having capacity problems, with timing that seems to line up suspicioously with what we might infer is a monthly billing cycle.
But there are a bunch of very different interpretations of this, which we can’t easily distinguish:
They defaulted on their bill with GCP or AWS, had a bunch of servers shut off, and discovered they no longer had sufficient capacity;
They declined to renew or scaled back their expenditure with GCP or AWS, thinking that their remaining compute resources were adequate, but they weren’t;
They declined to renew or scaled back their expenditure with GCP or AWS, and the remaining capacity was adequate, but they had problems with the migration that would be easier to fix given extra capacity-headroom;
They’re using the same servers as before, and had problems with some crawler/scraper changing its activity patterns
One thing I can say, from running LW, is that from a capacity perspective crawlers are a much bigger issue for websites than you’d naively expect. (LessWrong does have rate limits per-IP-address, they’re just high enough that you won’t ever hit them under normal usage.) So even if there was a capacity-reduction related to their supply of hardware, it may still be the case that most of their capacity was going to scrapers, and to try to limit the scrapers as a way of regaining capacity. It seems fairly likely that the rate-limiting option was set up in advance as a quick-response option for any sort of capacity issue (including capacity issues created by things like a developer accidentally deploying slow code, or surges in usage).
The main problem with crawlers is that their usage patterns don’t match those of regular users, and most optimization effort is focused on the usage patterns of real users, so bots sometimes wind up using the site in ways that consume orders of magnitude more compute per request than a regular user would. And some of these bots have been through many iterations of detection and counter-detection, and are routing their requests through residential-IP botnets, with fake user-agent strings trying to approximate real web browsers.
As for the shadowbanning thing—the real bug was probably a bit more subtle than the tweet-length description, but the bug itself is not surprising, and given the high-technical-debt codebase, probably not nearly as stupid as it sounds. Or rather: the effect may have been that stupid, but the code itself probably didn’t look on cursory inspection like it was that bad. Ie, I would assign pretty high probability to that code containing an attempt to normalize for visibility that didn’t work correctly, or an uncompleted todo-item to make a visibility correction score. A codebase like Twitter is going to have bugs like this, they can ony be discovered by skilled programmers doing forensic investigations, and executives will only know about them during the narrow time window between when they’re discovered and when they’re fixed.
And Twitter has recently destroyed his API, I think? Which perhaps has the effect of de-optimizing the usage patterns of bots.
As someone who has done scraping a few times, I can confirm that it’s trivial to circumvent protections against it, even for a novice programmer. In most cases, it’s literally less than 10 minutes of googling and trial & error.
And for a major AI / web-search company, it could be a routine task, with teams of dedicated professionals working on it.