gwern comments on Open Thread, June 16-30, 2012

gwern 15 Jun 2012 14:31 UTC
11 points
After a painful evening, I got an A/B test going on my site using Google Website Optimizer*: testing the CSS max-width property (800, 900, 1000, 1200, 1300, & 1400px). I noticed that most sites seem to set it much more narrowly than I did, eg. Readability. I set the ‘conversion’ target to be a 40-second timeout, as a way of measuring ‘are you still reading this?’

Overnight each variation got ~60 visitors. The original 1400px converts at 67.2% ± 11% while the top candidate 1300px converts at 82.3% ± 9.0% (an improvement of 22.4%) with an estimated 92.9% chance of beating the original. This suggests that a switch would materially increase how much time people spend reading my stuff.

(The other widths: currently, 1000px: 71.0% ± 10%; 900px: 68.1% ± 10%; 1200px: 66.7% ± 11%; 800px: 64.2% ± 11%.)

This is pretty cool—I was blind but now can see—yet I can’t help but wonder about the limits. Has anyone else thoroughly A/B-tested their personal sites? At what point do diminishing returns set in?

* I would prefer to use Optimizely or Visual Website Optimizer, but they charge just ludicrous sums: if I wanted to test my 50k monthly visitors, I’d be paying hundreds of dollars a month!
- Douglas_Knight 15 Jun 2012 19:33 UTC
  2 points
  Parent
  Do you know the size of your readers’ windows?
  
  How is the 93% calculated? Does it correct for multiple comparisons?
  
  Given some outside knowledge, that these 6 choices are not unrelated, but come from a ordered space of choices, the result that one value is special and all the others produce identical results is implausible. I predict that it is a fluke.
  - gwern 15 Jun 2012 19:47 UTC
    2 points
    Parent
    No, but it can probably be dug out of Google Analytics. I’ll let the experiment finish first.
    I’m not sure how exactly it is calculated. On what is apparently an official blog, the author says in a comment: “We do correct for multiple comparisons using the Bonferroni adjustment. We’ve looked into others, but they don’t offer that much more improvement over this conservative approach.”
    
    Yes, I’m finding the result odd. I really did expect some sort of inverted V result where a medium sized max-width was “just right”. Unfortunately, with a doubling of the sample size, the ordering remains pretty much the same: 1300px beats everyone, with 900px passing 1200px and 1100px. I’m starting to wonder if maybe there’s 2 distinct populations of users—maybe desktop users with wide screens and then smartphones? Doesn’t quite make sense since the phones should be setting their own width but...
    - Douglas_Knight 15 Jun 2012 20:52 UTC
      3 points
      Parent
      A bimodal distribution wouldn’t surprise me. What I don’t believe is a spike in the middle of a plain. If you had chosen increments of 200, the 1300 spike would have been completely invisible!