First, trying to optimize a metric without an A/B testing framework in place is kinda pointless. Maybe the growth achieved in Q3 was due to the changes made, but looking at the charts, it looks like a pretty typical quarter. It’s entirely plausible that growth would have been basically the same even without all this stuff. How much extra karma was actually generated due to removing login expiry? That’s exactly the sort of thing an A/B test is great for, and without A/B tests, the best we can do is guess in the dark.
Second (and I apologize if I’m wrong here), that list of projects does not sound like the sort of thing someone would come up with if they sat down for an hour with a blank slate and asked “how can the LW team get more karma generated?” They sound like the sort of projects which were probably on the docket anyway, and then you guys just checked afterward to see if they raised karma (except maybe some of the one-shot projects, but those won’t help long-term anyway).
Third, I do not think 7% was a mistaken target. I think Paul Graham was right on this one: only hitting 2% is a sign that you have not yet figured out what you’re doing. Trying to optimize a metric without even having a test framework in place adds a lot of evidence to that story—certainly in my own start-up experience, we never had any idea what we were doing until well after the test framework was in place (at any of the companies I’ve worked at). Analytics more generally were also always crucial for figuring out where the low-hanging fruit was and which projects to prioritize, and it sounds like you guys are currently still flying blind in that department.
So, maybe re-try targeting one metric for a full quarter after the groundwork is in place for it to work?
First, trying to optimize a metric without an A/B testing framework in place is kinda pointless. Maybe the growth achieved in Q3 was due to the changes made, but looking at the charts, it looks like a pretty typical quarter. It’s entirely plausible that growth would have been basically the same even without all this stuff. How much extra karma was actually generated due to removing login expiry? That’s exactly the sort of thing an A/B test is great for, and without A/B tests, the best we can do is guess in the dark.
I don’t think A/B testing would have really been useful for almost any of the above. Besides the login stuff all the other things were social features that don’t really work when only half of the people have access to them. Like, you can’t really A/B test shortform, or subscriptions, or automatic crossposting, or Petrov Day, or MSFP writing day, which is a significant fraction of things we worked on. I think if you want to A/B test social features you need a significantly larger and more fractured audience than we currently have.
I would be excited about A/B tests when they are feasible, but they don’t really seem easily applicable to most of the things we build. If you do have ways of making it work for these kinds of social features, I would be curious about your thoughts, since I currently don’t really see much use for A/B tests, but do think it would be good if we could get A/B test data.
Heckling appreciated. I’ll add a bit more to Habryka’s response.
Separate from the question of whether A/B would have been applicable to our projects, I’m not sure why think it’s pointless to try to make inferences without them. True, A/B tests are cleaner and more definitive, and what we observed is plausibly what would have happened even with different activities, but that isn’t to say we don’t learn a lot when the outcome is one of a) metric/growth stays flat, b) small decrease, c) small increase, d) large decrease, e) large increase. In particular, the growth we saw (increase in absolute and rate) is suggestive of doing something real and also strong evidence against the hypothesis that it’d be very easy to drive a lot of growth.
Generally, it’s at least suggestive that the first quarter where we explicitly we focus on growth is one where we see 40% growth from last quarter (compared to 20% in the previous quarter to the one before). It could be a coincidence, but I feel like there are still likelihood ratios here.
When it comes to attribution too, with some of these projects it’s easy to get much more of an idea even without A/B testing. I can look at the posts from authors who we contacted and reasonably believe counterfactually would not have otherwise posted and see how much karma that generated. Same from Petrov Days and MSFP.
Responding to both of you here: A/B tests are a mental habit which takes time to acquire. Right now, you guys are thinking in terms of big meaty projects, which aren’t the sort of thing A/B tests are for. I wouldn’t typically make a single A/B test for a big, complicated feature like shortform—I’d run lots of little A/B tests for different parts of it, like details of how it’s accessed and how it’s visible. It’s the little things: size/location/wording of buttons, sorting on the homepage, tweaking affordances, that sort of thing. Think nudges, not huge features. Those are the kinds of things which let you really drive up the metrics with relatively little effort, once you have the tests in place. Usually, it turns out that one or two seemingly-innocuous details are actually surprisingly important.
It’s true that you don’t necessarily need A/B tests to attribute growth to particular changes, especially if the changes are big things or one-off events, but that has some serious drawbacks even aside from the statistical uncertainty. Without A/B tests, we can’t distinguish between the effects of multiple changes made in the same time window, especially small changes, which means we can’t run lots of small tests. More fundamentally, an A/B test isn’t just about attribution, it’s about having a control group—with all the benefits that a control group brings, like fine-grained analysis of changes in behavior between test buckets.
I think incremental change is a bit overrated. Sure, if you have something that performs so well that chasing 1% improvements is worth it, then go for it. But don’t keep tweaking forever: you’ll get most of the gains in the first few months, and they will total about +20%, or maybe +50% if you’re a hero.
If your current thing doesn’t perform so well, it’s more cost-effective to look for big things that could bring +100% or +1000%. A/B tests are useful for that too, but need to be done differently:
Come up with a big thing that could have big impact. For example, shortform.
Identify the assumptions behind that thing. For example, “users will write shortform” or “users will engage with others’ shortform”.
Come up with cheap ways to test these assumptions. For example, “check the engagement on existing posts that are similar to shortform” or “suggest to some power users that they should make shortform posts and see how much engagement they get”. At this step you may end up looking at metrics, looking at competitors, or running cheap A/B tests.
Based on the previous steps, change your mind about which thing you want to build, and repeat these steps until you’re pretty sure it will succeed.
This line of thinking makes a major assumption which has, in my experience, been completely wrong: the assumption that a “big thing” in terms of impact is also a “big thing” in terms of engineering effort. I have seen many changes which are only small tweaks from an engineering standpoint, but produce 25% or 50% increase in a metric all on their own—things like making a button bigger, clarifying/shortening some text, changing something from red to green, etc. Design matters, it’s relatively easy to change, but we don’t know how to change it usefully without tests.
Agreed—I’ve seen, and made, quite a few such changes as well. After each big upheaval it’s worth spending some time grabbing the low hanging fruit. My only gripe is that I don’t think this type of change is sufficient over a project’s lifetime. Deeper product change has a way of becoming necessary.
I think the other thing A/B tests are good for is giving you a feedback source that isn’t your design sense. Instead of “do I think this looks prettier?” you ask questions like “which do users click on more?”. (And this eventually feeds back into your design sense, making it stronger.)
I find this compelling (along with the “finding out which things matter that you didn’t realize mattered) and think this is a reason for us to begin doing A/B testing sometime in not too distant future.
Second (and I apologize if I’m wrong here), that list of projects does not sound like the sort of thing someone would come up with if they sat down for an hour with a blank slate and asked “how can the LW team get more karma generated?”
It is a list of projects we prioritized based on how much karma we expect they would generate over the long run, filtered by things that didn’t seem like obviously goodharty ideas.
If these don’t seem like the things you would have put on the list, what other things would you have put on the list? I am genuinely curious, since I don’t have any obvious ideas for what I would have done instead.
A number of these projects were already on our docket, but less visible is the projects which were delayed and the fact that those selected might not have been done now otherwise. For example, if we hadn’t been doing metric quarter, I’d like have spent more of my time continuing work on the Open Questions platform and much less of my time doing interviews and talking to authors. Admittedly, subscriptions and the new editor are projects we were already committed to and had been working on, but if we hadn’t thought they’d help with the metric, we’d have delayed it to the next quarter the way we did with many of other project ideas.
We did brainstorm however, but as Oli said, it wasn’t easy to come with any ideas which were obviously much better.
Responding to both of you with one comment again: I sort of alluded to it in the A/B testing comment, but it’s less about any particular feature that’s missing and more about the general mindset. If you want to drive up metrics fast, then the magic formula is a tight iteration loop: testing large numbers of small changes to figure out which little things have disproportionate impact. Any not-yet-optimized UI is going to have lots of little trivial inconveniences and micro-confusions; identifying and fixing those can move the needle a lot with relatively little effort. Think about how facebook or amazon A/B tests every single button, every item in every sidebar, on their main pages. That sort of thing is very easy, once a testing framework is in place, and it has high yields.
As far as bigger projects go… until we know what the key factors are which drive engagement on LW, we really don’t have the tools to prioritize big projects. For purposes of driving up metrics, the biggest project right now is “figure out which things matter that we didn’t realize matter”. A/B tests are one of the main tools for that—looking at which little tweaks have big impact will give hints toward the bigger issues. Recorded user sessions (a la FullStory) are another really helpful tool. Interviews and talking to authors can be a substitute for that, although users usually don’t understand their own wants/needs very well. Analytics in general is obviously useful, although it’s tough to know which questions to ask without watching user sessions directly.
I see the spirit of what you’re saying and think there’s something to it though it doesn’t feel completely correct. That said, I don’t think anyone on the team has experience with that kind of A/B testing loop and given that lack of experience, we should try it out for at least a while on some projects.
To date, I’ve been working just to get us to have more of an analytics-mindset plus basic thorough analytics throughout the app, e.g. tracking on each of the features/buttons we build, etc. (This wasn’t trivial to do with e.g. Google Tag Manager so we’ve ended up building stuff in-house.) I think trying out A/B testing would likely make sense soon, but as above, I think there’s a lot of value even before it with more dumb/naive analytics.
We trialled FullStory for a few weeks and I agree it’s good, but also we just weren’t using it enough to justify it. LogRocket offers monthly subscription though and likely we’ll sign up for that soon. (Once we’re actually using it fully, not just trialling, we’ll need to post about it properly, build opt-out, etc. and be good around privacy—already in trial we hid e.g. voting, usernames.)
To come back to the opening points in the OP, we probably shouldn’t get too bogged down trying to optimize specific simple metrics by getting all the buttons perfect, etc., given the uncertainty over which metrics are even correct to focus on. For example, there isn’t any clear metric (that I can think of) that definitely answers how much to focus on bringing in new users and getting them up to speed vs building tools for existing users already producing good intellectual progress. I think it’s correct that have to use high-level models and fuzzier techniques to think about big project prioritization. A/B tests won’t resolve the most crucial uncertainties we have though I do think they’re likely to hugely helpful in refining our design sense.
I actually agree with the overall judgement there—optimizing simple metrics really hard is mainly useful for things like e.g. landing pages, where the goals really are pretty simple and there’s not too much danger of Goodharting. Lesswrong mostly isn’t like that, and most of the value in micro-optimizing would be in the knowledge gained, rather than the concrete result of increasing a metric. I do think there’s a lot of knowledge there to gain, and I think our design-level decisions are currently far away from the pareto frontier in ways that won’t be obvious until the micro-optimization loop starts up.
I will also say that the majority of people I’ve worked with have dramatically underestimated the magnitude of impact this sort of thing has until they saw it happen first-hand, for whatever that’s worth. (I first saw it in action at a company which achieved supercritical virality for a short time, and A/B-test-driven micro-optimization was the main tool responsible for that.) If this were a start-up, and we needed strong new user and engagement metrics to get our next round of funding, then I’d say it should be the highest priority. But this isn’t a startup, and I totally agree that A/B tests won’t solve the most crucial uncertainties.
Trying to optimize a metric without even having a test framework in place adds a lot of evidence to that story—certainly in my own start-up experience, we never had any idea what we were doing until well after the test framework was in place (at any of the companies I’ve worked at). Analytics more generally were also always crucial for figuring out where the low-hanging fruit was and which projects to prioritize, and it sounds like you guys are currently still flying blind in that department.
I think I agree with the general spirit here. Throughout my year with the LessWrong team, I’ve been progressively building out analytics infrastructure to reduce my sense of the “flying blind” you speak of. We’re not done yet, but I’ve now got a lot of data at my fingertips. I think the disagreement here would be over whether anything short of A/B testing is valuable. I’m pretty sure that it is.
I’m gonna heckle a bit from the peanut gallery...
First, trying to optimize a metric without an A/B testing framework in place is kinda pointless. Maybe the growth achieved in Q3 was due to the changes made, but looking at the charts, it looks like a pretty typical quarter. It’s entirely plausible that growth would have been basically the same even without all this stuff. How much extra karma was actually generated due to removing login expiry? That’s exactly the sort of thing an A/B test is great for, and without A/B tests, the best we can do is guess in the dark.
Second (and I apologize if I’m wrong here), that list of projects does not sound like the sort of thing someone would come up with if they sat down for an hour with a blank slate and asked “how can the LW team get more karma generated?” They sound like the sort of projects which were probably on the docket anyway, and then you guys just checked afterward to see if they raised karma (except maybe some of the one-shot projects, but those won’t help long-term anyway).
Third, I do not think 7% was a mistaken target. I think Paul Graham was right on this one: only hitting 2% is a sign that you have not yet figured out what you’re doing. Trying to optimize a metric without even having a test framework in place adds a lot of evidence to that story—certainly in my own start-up experience, we never had any idea what we were doing until well after the test framework was in place (at any of the companies I’ve worked at). Analytics more generally were also always crucial for figuring out where the low-hanging fruit was and which projects to prioritize, and it sounds like you guys are currently still flying blind in that department.
So, maybe re-try targeting one metric for a full quarter after the groundwork is in place for it to work?
I don’t think A/B testing would have really been useful for almost any of the above. Besides the login stuff all the other things were social features that don’t really work when only half of the people have access to them. Like, you can’t really A/B test shortform, or subscriptions, or automatic crossposting, or Petrov Day, or MSFP writing day, which is a significant fraction of things we worked on. I think if you want to A/B test social features you need a significantly larger and more fractured audience than we currently have.
I would be excited about A/B tests when they are feasible, but they don’t really seem easily applicable to most of the things we build. If you do have ways of making it work for these kinds of social features, I would be curious about your thoughts, since I currently don’t really see much use for A/B tests, but do think it would be good if we could get A/B test data.
Heckling appreciated. I’ll add a bit more to Habryka’s response.
Separate from the question of whether A/B would have been applicable to our projects, I’m not sure why think it’s pointless to try to make inferences without them. True, A/B tests are cleaner and more definitive, and what we observed is plausibly what would have happened even with different activities, but that isn’t to say we don’t learn a lot when the outcome is one of a) metric/growth stays flat, b) small decrease, c) small increase, d) large decrease, e) large increase. In particular, the growth we saw (increase in absolute and rate) is suggestive of doing something real and also strong evidence against the hypothesis that it’d be very easy to drive a lot of growth.
Generally, it’s at least suggestive that the first quarter where we explicitly we focus on growth is one where we see 40% growth from last quarter (compared to 20% in the previous quarter to the one before). It could be a coincidence, but I feel like there are still likelihood ratios here.
When it comes to attribution too, with some of these projects it’s easy to get much more of an idea even without A/B testing. I can look at the posts from authors who we contacted and reasonably believe counterfactually would not have otherwise posted and see how much karma that generated. Same from Petrov Days and MSFP.
Responding to both of you here: A/B tests are a mental habit which takes time to acquire. Right now, you guys are thinking in terms of big meaty projects, which aren’t the sort of thing A/B tests are for. I wouldn’t typically make a single A/B test for a big, complicated feature like shortform—I’d run lots of little A/B tests for different parts of it, like details of how it’s accessed and how it’s visible. It’s the little things: size/location/wording of buttons, sorting on the homepage, tweaking affordances, that sort of thing. Think nudges, not huge features. Those are the kinds of things which let you really drive up the metrics with relatively little effort, once you have the tests in place. Usually, it turns out that one or two seemingly-innocuous details are actually surprisingly important.
It’s true that you don’t necessarily need A/B tests to attribute growth to particular changes, especially if the changes are big things or one-off events, but that has some serious drawbacks even aside from the statistical uncertainty. Without A/B tests, we can’t distinguish between the effects of multiple changes made in the same time window, especially small changes, which means we can’t run lots of small tests. More fundamentally, an A/B test isn’t just about attribution, it’s about having a control group—with all the benefits that a control group brings, like fine-grained analysis of changes in behavior between test buckets.
I think incremental change is a bit overrated. Sure, if you have something that performs so well that chasing 1% improvements is worth it, then go for it. But don’t keep tweaking forever: you’ll get most of the gains in the first few months, and they will total about +20%, or maybe +50% if you’re a hero.
If your current thing doesn’t perform so well, it’s more cost-effective to look for big things that could bring +100% or +1000%. A/B tests are useful for that too, but need to be done differently:
Come up with a big thing that could have big impact. For example, shortform.
Identify the assumptions behind that thing. For example, “users will write shortform” or “users will engage with others’ shortform”.
Come up with cheap ways to test these assumptions. For example, “check the engagement on existing posts that are similar to shortform” or “suggest to some power users that they should make shortform posts and see how much engagement they get”. At this step you may end up looking at metrics, looking at competitors, or running cheap A/B tests.
Based on the previous steps, change your mind about which thing you want to build, and repeat these steps until you’re pretty sure it will succeed.
Build the thing.
This is roughly the procedure we usually follow.
This line of thinking makes a major assumption which has, in my experience, been completely wrong: the assumption that a “big thing” in terms of impact is also a “big thing” in terms of engineering effort. I have seen many changes which are only small tweaks from an engineering standpoint, but produce 25% or 50% increase in a metric all on their own—things like making a button bigger, clarifying/shortening some text, changing something from red to green, etc. Design matters, it’s relatively easy to change, but we don’t know how to change it usefully without tests.
Agreed—I’ve seen, and made, quite a few such changes as well. After each big upheaval it’s worth spending some time grabbing the low hanging fruit. My only gripe is that I don’t think this type of change is sufficient over a project’s lifetime. Deeper product change has a way of becoming necessary.
I think the other thing A/B tests are good for is giving you a feedback source that isn’t your design sense. Instead of “do I think this looks prettier?” you ask questions like “which do users click on more?”. (And this eventually feeds back into your design sense, making it stronger.)
I find this compelling (along with the “finding out which things matter that you didn’t realize mattered) and think this is a reason for us to begin doing A/B testing sometime in not too distant future.
Yes, heckling is definitely appreciated!
It is a list of projects we prioritized based on how much karma we expect they would generate over the long run, filtered by things that didn’t seem like obviously goodharty ideas.
If these don’t seem like the things you would have put on the list, what other things would you have put on the list? I am genuinely curious, since I don’t have any obvious ideas for what I would have done instead.
A number of these projects were already on our docket, but less visible is the projects which were delayed and the fact that those selected might not have been done now otherwise. For example, if we hadn’t been doing metric quarter, I’d like have spent more of my time continuing work on the Open Questions platform and much less of my time doing interviews and talking to authors. Admittedly, subscriptions and the new editor are projects we were already committed to and had been working on, but if we hadn’t thought they’d help with the metric, we’d have delayed it to the next quarter the way we did with many of other project ideas.
We did brainstorm however, but as Oli said, it wasn’t easy to come with any ideas which were obviously much better.
Responding to both of you with one comment again: I sort of alluded to it in the A/B testing comment, but it’s less about any particular feature that’s missing and more about the general mindset. If you want to drive up metrics fast, then the magic formula is a tight iteration loop: testing large numbers of small changes to figure out which little things have disproportionate impact. Any not-yet-optimized UI is going to have lots of little trivial inconveniences and micro-confusions; identifying and fixing those can move the needle a lot with relatively little effort. Think about how facebook or amazon A/B tests every single button, every item in every sidebar, on their main pages. That sort of thing is very easy, once a testing framework is in place, and it has high yields.
As far as bigger projects go… until we know what the key factors are which drive engagement on LW, we really don’t have the tools to prioritize big projects. For purposes of driving up metrics, the biggest project right now is “figure out which things matter that we didn’t realize matter”. A/B tests are one of the main tools for that—looking at which little tweaks have big impact will give hints toward the bigger issues. Recorded user sessions (a la FullStory) are another really helpful tool. Interviews and talking to authors can be a substitute for that, although users usually don’t understand their own wants/needs very well. Analytics in general is obviously useful, although it’s tough to know which questions to ask without watching user sessions directly.
I see the spirit of what you’re saying and think there’s something to it though it doesn’t feel completely correct. That said, I don’t think anyone on the team has experience with that kind of A/B testing loop and given that lack of experience, we should try it out for at least a while on some projects.
To date, I’ve been working just to get us to have more of an analytics-mindset plus basic thorough analytics throughout the app, e.g. tracking on each of the features/buttons we build, etc. (This wasn’t trivial to do with e.g. Google Tag Manager so we’ve ended up building stuff in-house.) I think trying out A/B testing would likely make sense soon, but as above, I think there’s a lot of value even before it with more dumb/naive analytics.
We trialled FullStory for a few weeks and I agree it’s good, but also we just weren’t using it enough to justify it. LogRocket offers monthly subscription though and likely we’ll sign up for that soon. (Once we’re actually using it fully, not just trialling, we’ll need to post about it properly, build opt-out, etc. and be good around privacy—already in trial we hid e.g. voting, usernames.)
To come back to the opening points in the OP, we probably shouldn’t get too bogged down trying to optimize specific simple metrics by getting all the buttons perfect, etc., given the uncertainty over which metrics are even correct to focus on. For example, there isn’t any clear metric (that I can think of) that definitely answers how much to focus on bringing in new users and getting them up to speed vs building tools for existing users already producing good intellectual progress. I think it’s correct that have to use high-level models and fuzzier techniques to think about big project prioritization. A/B tests won’t resolve the most crucial uncertainties we have though I do think they’re likely to hugely helpful in refining our design sense.
I actually agree with the overall judgement there—optimizing simple metrics really hard is mainly useful for things like e.g. landing pages, where the goals really are pretty simple and there’s not too much danger of Goodharting. Lesswrong mostly isn’t like that, and most of the value in micro-optimizing would be in the knowledge gained, rather than the concrete result of increasing a metric. I do think there’s a lot of knowledge there to gain, and I think our design-level decisions are currently far away from the pareto frontier in ways that won’t be obvious until the micro-optimization loop starts up.
I will also say that the majority of people I’ve worked with have dramatically underestimated the magnitude of impact this sort of thing has until they saw it happen first-hand, for whatever that’s worth. (I first saw it in action at a company which achieved supercritical virality for a short time, and A/B-test-driven micro-optimization was the main tool responsible for that.) If this were a start-up, and we needed strong new user and engagement metrics to get our next round of funding, then I’d say it should be the highest priority. But this isn’t a startup, and I totally agree that A/B tests won’t solve the most crucial uncertainties.
I think I agree with the general spirit here. Throughout my year with the LessWrong team, I’ve been progressively building out analytics infrastructure to reduce my sense of the “flying blind” you speak of. We’re not done yet, but I’ve now got a lot of data at my fingertips. I think the disagreement here would be over whether anything short of A/B testing is valuable. I’m pretty sure that it is.