LessWrong downtime 2010-05-11, and other recent outages and instability

matt22 May 2010 1:33 UTC

25 points

Incident report and hosting update

In the leadup to 2010-05-11 we (Tricycle) were unhappy with repeated short downtime incidents on the Less Wrong (LW) server (serpent). The apparent cause was the paster process hanging during heavy IO. We had scripted an automatic restart of the process when this problem was detected, but each incident caused up to a minute of downtime and it was obvious that we needed a proper solution. We concluded that IO on serpent was abnormally slow, and that the physical machine at Slicehost that serpent ran on had IO problems (Slicehost was unable to confirm our diagnosis). We requested migration to a new physical machine.

Error 1: We requested this migration at the end of our working day, and didn’t nurse the migration through.

After the migration LW booted properly, but was quickly unstable. Since we didn’t nurse the migration through we failed to notice ourselves. Our website monitoring system (nagios) should have notified us of the failure, but it, too failed. We have a website monitoring system monitoring system (who watches the watchers? this system does—it is itself watched by nagios).

Error 2: Our website monitoring system monitoring system (a cron job running on a separate machine) was only capable of reporting nagios failures by email. It “succeeded” in so far as it sent an email to our sysadmin notifying him that nagios was failing. It clearly failed in that it failed to actually notify a human in reasonable time (our sysadmin very reasonably doesn’t check his email during meals).

serpent continued to be unstable through our next morning as we worked on diagnosing and fixing the problem. IO performance did not improve on a new physical server.

2010-05-17 we migrated the system again to an AWS server, and saw significant speed and general stability improvements.

Error 3: The new AWS server didn’t include one of the python dependencies the signup captcha relies on. We didn’t notice. Until davidjr raised an issue in the tracker (#207), which notified us, no-one was able to sign up.

What we have achieved:

LW is now significantly faster and more responsive. It also has much more headroom on its server—even large load spikes should not reduce performance.

What has been done to prevent recurrence of errors:

Error 1: Human error. We won’t do that again. Generally “don’t do that again” isn’t a very good systems improvement… but we really should have known better.

Error 2: We improved our monitoring system monitoring system the morning after it failed to notify us so that it now attempts to restart nagios itself, and sends SMS notifications and emails to two of us if it fails.

Error 3: We’re in the process of building a manual deploy checklist to check for this failure and other failures we think plausible. We generally prefer automated testing, but development on this project is not currently active enough to justify the investment. We’ll add an active reminder to run that checklist to our deploy script (we’ll have to answer “yes, I have run the checklist” or something similar in the deploy script).

ETA 2010-06-02:

Clearly still some problems. We’re working on them.

ETA 2010-06-09:

New deployment through an AWS elastic load balancer. We expect this to be substantially more stable, and after DNS propagates, faster.

What links here?

LessWrong downtime 2012-03-26, and site speed by matt (3 Apr 2012 4:15 UTC; 62 points)

matt22 May 2010 1:33 UTC

25 points

20 comments2 min readLW link Archive

Site Meta

CarlShulman 22 May 2010 4:26 UTC
12 points
Three cheers for Tricycle!
- Zack_M_Davis 22 May 2010 4:53 UTC
  6 points
  Parent
  Hip hip---
  - Alicorn 22 May 2010 5:03 UTC
    4 points
    Parent
    Hooray!
    - Zack_M_Davis 22 May 2010 5:21 UTC
      4 points
      Parent
      Hip hip---
      - Paul Crowley 22 May 2010 8:16 UTC
        4 points
        Parent
        Hooray!
        John_Maxwell 22 May 2010 19:34 UTC
        2 points
        Parent
        Hip hip---
        arundelo 23 May 2010 2:43 UTC
        2 points
        Parent
        Hooray!!
      - Alicorn 22 May 2010 5:37 UTC
        2 points
        Parent
        How many times does this traditionally iterate?
        MBlume 22 May 2010 6:05 UTC
        0 points
        Parent
        Carl:
        
        Three cheers for Tricycle!
        
        =P
        Alicorn 22 May 2010 6:19 UTC
        3 points
        Parent
        I can’t ever quite remember if the “hips” count as cheers or not.
        Paul Crowley 22 May 2010 8:16 UTC
        1 point
        Parent
        Nope, only the “Hooray”s.
orthonormal 23 May 2010 21:27 UTC
7 points

We improved our monitoring system monitoring system the morning after it failed to notify us

We apologise again for the fault in the subtitles. Those responsible for sacking the people who have just been sacked have been sacked.
MBlume 22 May 2010 2:17 UTC
7 points
What are the current pain points in serving Less Wrong? Reddit’s made some significant performance improvements since the code was forked, including implementing Markdown in C rather than Python, and using Cassandra for in-memory caching—would it be worth it to look through these changes and see if any make sense to apply to Less Wrong?

ETA: For example, I could probably isolate and apply the Markdown changes if it was likely to be useful.
- matt 9 Jun 2010 7:27 UTC
  3 points
  Parent
  The largest pain point has been instability in the paster process. The new deployment (as of a couple of ours ago (2010-06-09)) should roll out a new application server if that happens again.
  
  Processor load has not been a problem, so improvements to the efficiency of the Markdown parser will have minimal impact unless traffic grows a lot.
  
  (Thanks for your offer of assistance, and sorry about the late reply.)
Morendil 22 May 2010 8:09 UTC
1 point
A side-effect of this deployment worth noting: the integrated Anti-Kibitz is now live.

I’m not sure to what extent that was intentional and readers should expect the feature to stay for good—be advised that it has some issues (incompatible with IE; some minor browser compatibility bugs).
- RobinZ 22 May 2010 12:19 UTC
  0 points
  Parent
  On my copy of Safari—Version 4.0.5 (5531.22.7), Mac OS X Version 10.5.8 Build 9L30 - it has the amusing feature of loading windows with names visible rather than invisible.
  
  Edit: Oh, and the formatting got a little amusing when I posted a reply with names hidden.
  - Morendil 22 May 2010 14:56 UTC
    4 points
    Parent
    My apologies for the version which ended up being deployed, which is well below the level of quality I normally aspire to.
    
    I’ve just committed a major revision to the AK script which I’d design-sketched a while ago but had shelved pending a better handle on how to write unit tests for LW.
    
    The newer version should be compatible with IE, more responsive than previous versions, and should allow more fine-tuning in future versions.
    
    However I can’t offer any guarantees as to when Tricycle are likely to pull the changes and deploy them to the live server. I don’t know yet how best to coordinate with the Tricycle crew when working on the LW codebase, and the current situation was also a consequence of that.
  - Morendil 22 May 2010 13:07 UTC
    2 points
    Parent
    Definitely needs more testing and tweaking: as I said above I wasn’t expecting these changes to get pulled in.
    
    I’ve already committed a change that corrects the second issue, I’ve just reproduced the first on my local Safari, and am looking into it.
    - wmoore 24 May 2010 5:21 UTC
      4 points
      Parent
      Morendil your latest patch has been applied and deployed. We took the liberty of inverting the preference for two reasons: Firstly it made more sense for the checkbox to be checked when the kibitzer was enabled and secondly we added a note that full compatibility requires Firefox for now, the text of which was hard to phrase with the uncheck to enable behaviour.
      - Morendil 24 May 2010 7:40 UTC
        2 points
        Parent
        Great, thanks!
        
        I note that the revision you’ve pulled is 0.5, which should be a lot more cross-browser than previous ones. I’ve done some testing under IE, Chrome and Safari.
        
        The original relied on XPath queries to locate page elements corresponding to scores and names. Now that the script is part of the LW codebase, I was able to simplify the script’s operation by including a special CSS stylesheet when the Anti-Kibitzer is active, and acting directly on the display rules. This is faster and more compatible.