gwern comments on Experiments and Consent

gwern 12 Nov 2019 3:08 UTC
21 points

The claim was that A/B testing was “not as good a tool for measuring long term changes in behavior” and I’m saying that A/B testing is a very good tool for that purpose.

And the paper you linked showed that it wasn’t being done for most of Google’s history. If Google doesn’t do it, I would be doubtful if anyone, even a peer like Amazon, does. Is it such a good tool if no one uses it?

By 2013 they were certainly already taking into account long-term value, even on mobile (which was pretty small until just around 2013). This section isn’t saying “we set the threshold for the number of ads to run too high” but “we were able to use our long-term value measurements to better figure out which ads not to run”.

Which is just another way of saying that before then they hadn’t used their long-term value measurements to figure out what threshold of ads to run before. Whether 2015 or 2013, this is damning. (As are, of course, the other ones I collate, with the exception of Mozilla who don’t dare make an explosive move like shipping adblockers installed by default, so the VoI to them is minimal.)

The result which would have been exculpatory is if they said, “we ran an extra-special long-term experiment to check we weren’t fucking up anything, and it turns out that, thanks to all our earlier long-term experiments dating back many years which were run on a regular basis as a matter of course, we had already gotten it about right! Phew! We don’t need to worry about it after all. Turns out we hadn’t A/B-tested our way into a user-hostile design by using wrong or short-sighted metrics. Boy it sure would be bad if we had designed things so badly that simply reducing ads could increase revenue so much.” But that is not what they said.
- jefftk 12 Nov 2019 15:03 UTC
  2 points
  Parent
  
  And the paper you linked showed that it wasn’t being done for most of Google’s history.
  
  This is a nitpick, but 2000-2007 (the period between when AdWords launched and when the paper says they started quantitative ad blindness research) is ¹⁄₃ of Google’s history, not “most”.
  
  I’m also not sure if the experiments could have been run much earlier, because I’m not sure identity was stable enough before users were signing into search pages.
  
  Also, this sort of optimization isn’t that valuable compared to much bigger opportunities for growth they had in the early 2000s.
  
  If Google doesn’t do it, I would be doubtful if anyone, even a peer like Amazon, does.
  
  Why are you saying Google doesn’t do it? I understand arguing about whether Google was doing it at various times, whether they should have prioritized it more highly, etc, but it’s clearly used and I’ve talked to people who work on it.
  
  Would you be interested in betting on whether Amazon has quantified the effects of ad blindness? I think we could probably find an Amazon employee to verify.
  
  Which is just another way of saying that before then they hadn’t used their long-term value measurements to figure out what threshold of ads to run before. Whether 2015 or 2013, this is damning.
  
  It’s specifically about mobile, which in 2013 was only about 10% of traffic and much less by monetization. Similar desktop experiments had been run earlier.
  
  But I also think you’re misinterpreting the paper to be about “how many ads should we run” and that those launches simply reduced the number of ads they were running. I’m claiming that the tuning of how many ads to run to maximize long-term value was already pretty good by 2013, but having a better experimental framework allowed them to increase long-term value by figuring out which specific kinds of ads to run or not run. As a rough example (from my head, I haven’t looked at these launches) imagine an advertiser is willing to pay you a lot to run a bad ad that makes people pay less attention to your ads overall. If you turn down your threshold for how many ads to show, this bad ad will still get through. Measuring this kind of negative externality that varies on a per-ad basis is really hard, and it’s especially hard if you have to run very long experiments to quantify the effect. One of the powerful tools in the paper is estimating long-term impacts from short term metrics so you can iterate faster, which makes it easier to evaluate many things including these kind of externalities.
  
  (As before, speaking only for myself and not for Google)