First, trying to optimize a metric without an A/B testing framework in place is kinda pointless. Maybe the growth achieved in Q3 was due to the changes made, but looking at the charts, it looks like a pretty typical quarter. It’s entirely plausible that growth would have been basically the same even without all this stuff. How much extra karma was actually generated due to removing login expiry? That’s exactly the sort of thing an A/B test is great for, and without A/B tests, the best we can do is guess in the dark.
I don’t think A/B testing would have really been useful for almost any of the above. Besides the login stuff all the other things were social features that don’t really work when only half of the people have access to them. Like, you can’t really A/B test shortform, or subscriptions, or automatic crossposting, or Petrov Day, or MSFP writing day, which is a significant fraction of things we worked on. I think if you want to A/B test social features you need a significantly larger and more fractured audience than we currently have.
I would be excited about A/B tests when they are feasible, but they don’t really seem easily applicable to most of the things we build. If you do have ways of making it work for these kinds of social features, I would be curious about your thoughts, since I currently don’t really see much use for A/B tests, but do think it would be good if we could get A/B test data.
Heckling appreciated. I’ll add a bit more to Habryka’s response.
Separate from the question of whether A/B would have been applicable to our projects, I’m not sure why think it’s pointless to try to make inferences without them. True, A/B tests are cleaner and more definitive, and what we observed is plausibly what would have happened even with different activities, but that isn’t to say we don’t learn a lot when the outcome is one of a) metric/growth stays flat, b) small decrease, c) small increase, d) large decrease, e) large increase. In particular, the growth we saw (increase in absolute and rate) is suggestive of doing something real and also strong evidence against the hypothesis that it’d be very easy to drive a lot of growth.
Generally, it’s at least suggestive that the first quarter where we explicitly we focus on growth is one where we see 40% growth from last quarter (compared to 20% in the previous quarter to the one before). It could be a coincidence, but I feel like there are still likelihood ratios here.
When it comes to attribution too, with some of these projects it’s easy to get much more of an idea even without A/B testing. I can look at the posts from authors who we contacted and reasonably believe counterfactually would not have otherwise posted and see how much karma that generated. Same from Petrov Days and MSFP.
Responding to both of you here: A/B tests are a mental habit which takes time to acquire. Right now, you guys are thinking in terms of big meaty projects, which aren’t the sort of thing A/B tests are for. I wouldn’t typically make a single A/B test for a big, complicated feature like shortform—I’d run lots of little A/B tests for different parts of it, like details of how it’s accessed and how it’s visible. It’s the little things: size/location/wording of buttons, sorting on the homepage, tweaking affordances, that sort of thing. Think nudges, not huge features. Those are the kinds of things which let you really drive up the metrics with relatively little effort, once you have the tests in place. Usually, it turns out that one or two seemingly-innocuous details are actually surprisingly important.
It’s true that you don’t necessarily need A/B tests to attribute growth to particular changes, especially if the changes are big things or one-off events, but that has some serious drawbacks even aside from the statistical uncertainty. Without A/B tests, we can’t distinguish between the effects of multiple changes made in the same time window, especially small changes, which means we can’t run lots of small tests. More fundamentally, an A/B test isn’t just about attribution, it’s about having a control group—with all the benefits that a control group brings, like fine-grained analysis of changes in behavior between test buckets.
I think incremental change is a bit overrated. Sure, if you have something that performs so well that chasing 1% improvements is worth it, then go for it. But don’t keep tweaking forever: you’ll get most of the gains in the first few months, and they will total about +20%, or maybe +50% if you’re a hero.
If your current thing doesn’t perform so well, it’s more cost-effective to look for big things that could bring +100% or +1000%. A/B tests are useful for that too, but need to be done differently:
Come up with a big thing that could have big impact. For example, shortform.
Identify the assumptions behind that thing. For example, “users will write shortform” or “users will engage with others’ shortform”.
Come up with cheap ways to test these assumptions. For example, “check the engagement on existing posts that are similar to shortform” or “suggest to some power users that they should make shortform posts and see how much engagement they get”. At this step you may end up looking at metrics, looking at competitors, or running cheap A/B tests.
Based on the previous steps, change your mind about which thing you want to build, and repeat these steps until you’re pretty sure it will succeed.
This line of thinking makes a major assumption which has, in my experience, been completely wrong: the assumption that a “big thing” in terms of impact is also a “big thing” in terms of engineering effort. I have seen many changes which are only small tweaks from an engineering standpoint, but produce 25% or 50% increase in a metric all on their own—things like making a button bigger, clarifying/shortening some text, changing something from red to green, etc. Design matters, it’s relatively easy to change, but we don’t know how to change it usefully without tests.
Agreed—I’ve seen, and made, quite a few such changes as well. After each big upheaval it’s worth spending some time grabbing the low hanging fruit. My only gripe is that I don’t think this type of change is sufficient over a project’s lifetime. Deeper product change has a way of becoming necessary.
I think the other thing A/B tests are good for is giving you a feedback source that isn’t your design sense. Instead of “do I think this looks prettier?” you ask questions like “which do users click on more?”. (And this eventually feeds back into your design sense, making it stronger.)
I find this compelling (along with the “finding out which things matter that you didn’t realize mattered) and think this is a reason for us to begin doing A/B testing sometime in not too distant future.
I don’t think A/B testing would have really been useful for almost any of the above. Besides the login stuff all the other things were social features that don’t really work when only half of the people have access to them. Like, you can’t really A/B test shortform, or subscriptions, or automatic crossposting, or Petrov Day, or MSFP writing day, which is a significant fraction of things we worked on. I think if you want to A/B test social features you need a significantly larger and more fractured audience than we currently have.
I would be excited about A/B tests when they are feasible, but they don’t really seem easily applicable to most of the things we build. If you do have ways of making it work for these kinds of social features, I would be curious about your thoughts, since I currently don’t really see much use for A/B tests, but do think it would be good if we could get A/B test data.
Heckling appreciated. I’ll add a bit more to Habryka’s response.
Separate from the question of whether A/B would have been applicable to our projects, I’m not sure why think it’s pointless to try to make inferences without them. True, A/B tests are cleaner and more definitive, and what we observed is plausibly what would have happened even with different activities, but that isn’t to say we don’t learn a lot when the outcome is one of a) metric/growth stays flat, b) small decrease, c) small increase, d) large decrease, e) large increase. In particular, the growth we saw (increase in absolute and rate) is suggestive of doing something real and also strong evidence against the hypothesis that it’d be very easy to drive a lot of growth.
Generally, it’s at least suggestive that the first quarter where we explicitly we focus on growth is one where we see 40% growth from last quarter (compared to 20% in the previous quarter to the one before). It could be a coincidence, but I feel like there are still likelihood ratios here.
When it comes to attribution too, with some of these projects it’s easy to get much more of an idea even without A/B testing. I can look at the posts from authors who we contacted and reasonably believe counterfactually would not have otherwise posted and see how much karma that generated. Same from Petrov Days and MSFP.
Responding to both of you here: A/B tests are a mental habit which takes time to acquire. Right now, you guys are thinking in terms of big meaty projects, which aren’t the sort of thing A/B tests are for. I wouldn’t typically make a single A/B test for a big, complicated feature like shortform—I’d run lots of little A/B tests for different parts of it, like details of how it’s accessed and how it’s visible. It’s the little things: size/location/wording of buttons, sorting on the homepage, tweaking affordances, that sort of thing. Think nudges, not huge features. Those are the kinds of things which let you really drive up the metrics with relatively little effort, once you have the tests in place. Usually, it turns out that one or two seemingly-innocuous details are actually surprisingly important.
It’s true that you don’t necessarily need A/B tests to attribute growth to particular changes, especially if the changes are big things or one-off events, but that has some serious drawbacks even aside from the statistical uncertainty. Without A/B tests, we can’t distinguish between the effects of multiple changes made in the same time window, especially small changes, which means we can’t run lots of small tests. More fundamentally, an A/B test isn’t just about attribution, it’s about having a control group—with all the benefits that a control group brings, like fine-grained analysis of changes in behavior between test buckets.
I think incremental change is a bit overrated. Sure, if you have something that performs so well that chasing 1% improvements is worth it, then go for it. But don’t keep tweaking forever: you’ll get most of the gains in the first few months, and they will total about +20%, or maybe +50% if you’re a hero.
If your current thing doesn’t perform so well, it’s more cost-effective to look for big things that could bring +100% or +1000%. A/B tests are useful for that too, but need to be done differently:
Come up with a big thing that could have big impact. For example, shortform.
Identify the assumptions behind that thing. For example, “users will write shortform” or “users will engage with others’ shortform”.
Come up with cheap ways to test these assumptions. For example, “check the engagement on existing posts that are similar to shortform” or “suggest to some power users that they should make shortform posts and see how much engagement they get”. At this step you may end up looking at metrics, looking at competitors, or running cheap A/B tests.
Based on the previous steps, change your mind about which thing you want to build, and repeat these steps until you’re pretty sure it will succeed.
Build the thing.
This is roughly the procedure we usually follow.
This line of thinking makes a major assumption which has, in my experience, been completely wrong: the assumption that a “big thing” in terms of impact is also a “big thing” in terms of engineering effort. I have seen many changes which are only small tweaks from an engineering standpoint, but produce 25% or 50% increase in a metric all on their own—things like making a button bigger, clarifying/shortening some text, changing something from red to green, etc. Design matters, it’s relatively easy to change, but we don’t know how to change it usefully without tests.
Agreed—I’ve seen, and made, quite a few such changes as well. After each big upheaval it’s worth spending some time grabbing the low hanging fruit. My only gripe is that I don’t think this type of change is sufficient over a project’s lifetime. Deeper product change has a way of becoming necessary.
I think the other thing A/B tests are good for is giving you a feedback source that isn’t your design sense. Instead of “do I think this looks prettier?” you ask questions like “which do users click on more?”. (And this eventually feeds back into your design sense, making it stronger.)
I find this compelling (along with the “finding out which things matter that you didn’t realize mattered) and think this is a reason for us to begin doing A/B testing sometime in not too distant future.
Yes, heckling is definitely appreciated!