LessWrong downtime 2010-05-11, and other recent outages and instability

Incident report and hosting update

In the leadup to 2010-05-11 we (Tricycle) were unhappy with repeated short downtime incidents on the Less Wrong (LW) server (serpent). The apparent cause was the paster process hanging during heavy IO. We had scripted an automatic restart of the process when this problem was detected, but each incident caused up to a minute of downtime and it was obvious that we needed a proper solution. We concluded that IO on serpent was abnormally slow, and that the physical machine at Slicehost that serpent ran on had IO problems (Slicehost was unable to confirm our diagnosis). We requested migration to a new physical machine.

Error 1: We requested this migration at the end of our working day, and didn’t nurse the migration through.

After the migration LW booted properly, but was quickly unstable. Since we didn’t nurse the migration through we failed to notice ourselves. Our website monitoring system (nagios) should have notified us of the failure, but it, too failed. We have a website monitoring system monitoring system (who watches the watchers? this system does—it is itself watched by nagios).

Error 2: Our website monitoring system monitoring system (a cron job running on a separate machine) was only capable of reporting nagios failures by email. It “succeeded” in so far as it sent an email to our sysadmin notifying him that nagios was failing. It clearly failed in that it failed to actually notify a human in reasonable time (our sysadmin very reasonably doesn’t check his email during meals).

serpent continued to be unstable through our next morning as we worked on diagnosing and fixing the problem. IO performance did not improve on a new physical server.

2010-05-17 we migrated the system again to an AWS server, and saw significant speed and general stability improvements.

Error 3: The new AWS server didn’t include one of the python dependencies the signup captcha relies on. We didn’t notice. Until davidjr raised an issue in the tracker (#207), which notified us, no-one was able to sign up.

What we have achieved:

LW is now significantly faster and more responsive. It also has much more headroom on its server—even large load spikes should not reduce performance.

What has been done to prevent recurrence of errors:

Error 1: Human error. We won’t do that again. Generally “don’t do that again” isn’t a very good systems improvement… but we really should have known better.

Error 2: We improved our monitoring system monitoring system the morning after it failed to notify us so that it now attempts to restart nagios itself, and sends SMS notifications and emails to two of us if it fails.

Error 3: We’re in the process of building a manual deploy checklist to check for this failure and other failures we think plausible. We generally prefer automated testing, but development on this project is not currently active enough to justify the investment. We’ll add an active reminder to run that checklist to our deploy script (we’ll have to answer “yes, I have run the checklist” or something similar in the deploy script).

ETA 2010-06-02:

Clearly still some problems. We’re working on them.

ETA 2010-06-09:

New deployment through an AWS elastic load balancer. We expect this to be substantially more stable, and after DNS propagates, faster.