gwern comments on Experiments and Consent

gwern 11 Nov 2019 22:07 UTC
17 points
And, as that paper inadvertently demonstrates (among others, including my own A/B testing), most companies manage to not run any of those long-term experiments and do things like overload ads to get short-term revenue boosts at the cost of both user happiness and their own long-term bottom line.

That includes Google: note that at the end of a paper published in 2015, for a company which has been around for a while in the online ad business, let us say, they are shocked to realize they are running way too many ads and can boost revenue by cutting ad load.

Ads are the core of Google’s business and the core of all A/B testing as practiced. Ads are the first, second, third, and last thing any online business will A/B test, and if there’s time left over, maybe something else will get tested. If even Google can fuck that up for so long so badly, what else are they fucking up UI-wise? A fortiori, what else is everyone else online fucking up even worse?
- jefftk 12 Nov 2019 2:31 UTC
  2 points
  Parent
  
  Most companies manage to not run any of those long-term experiments and do things like overload ads to get short-term revenue boosts at the cost of both user happiness and their own long-term bottom line.
  
  The claim was that A/B testing was “not as good a tool for measuring long term changes in behavior” and I’m saying that A/B testing is a very good tool for that purpose. That companies generally don’t do it I think is mostly a lack of long-term focus, independent of experiments. I’m sure Amazon does it.
  
  Note that at the end of a paper published in 2015, for a company which has been around for a while in the online ad business, let us say, they are shocked to realize they are running way too many ads and can boost revenue by cutting ad load.
  
  The paper was published in 2015, but describes work on estimating long-term value going back to at least 2007. It sounds like you’re referring to the end of section five, where they say “In 2013 we ran experiments that changed the ad load on mobile devices … This and similar ads blindness studies led to a sequence of launches that decreased the search ad load on Google’s mobile traffic by 50%, resulting in dramatic gains in user experience metrics.” By 2013 they were certainly already taking into account long-term value, even on mobile (which was pretty small until just around 2013). This section isn’t saying “we set the threshold for the number of ads to run too high” but “we were able to use our long-term value measurements to better figure out which ads not to run”. So I don’t think “if even Google can fuck that up for so long so badly” is a good reading of the paper.
  
  Ads are the first, second, third, and last thing any online business will A/B test, and if there’s time left over, maybe something else will get tested.
  
  I work in display ads and I don’t think this is right. Where you see the most A/B testing is in funnels. If you’re selling something the gains from optimizing the flow from “user arrives on your site” to “user finishes buying the thing” are often enormous, like >10x. Whereas with ads if you just stick AdSense or something similar on your page you’re going to be within, say, 60% of where you could be with a super complicated header bidding setup. And if you want to make more money with ads your time is better spent on negotiating direct deals with advertisers than on A/B testing. I dearly wish I could get publishers to A/B test their ad setups!
  - gwern 12 Nov 2019 3:08 UTC
    21 points
    Parent
    
    The claim was that A/B testing was “not as good a tool for measuring long term changes in behavior” and I’m saying that A/B testing is a very good tool for that purpose.
    
    And the paper you linked showed that it wasn’t being done for most of Google’s history. If Google doesn’t do it, I would be doubtful if anyone, even a peer like Amazon, does. Is it such a good tool if no one uses it?
    
    By 2013 they were certainly already taking into account long-term value, even on mobile (which was pretty small until just around 2013). This section isn’t saying “we set the threshold for the number of ads to run too high” but “we were able to use our long-term value measurements to better figure out which ads not to run”.
    
    Which is just another way of saying that before then they hadn’t used their long-term value measurements to figure out what threshold of ads to run before. Whether 2015 or 2013, this is damning. (As are, of course, the other ones I collate, with the exception of Mozilla who don’t dare make an explosive move like shipping adblockers installed by default, so the VoI to them is minimal.)
    
    The result which would have been exculpatory is if they said, “we ran an extra-special long-term experiment to check we weren’t fucking up anything, and it turns out that, thanks to all our earlier long-term experiments dating back many years which were run on a regular basis as a matter of course, we had already gotten it about right! Phew! We don’t need to worry about it after all. Turns out we hadn’t A/B-tested our way into a user-hostile design by using wrong or short-sighted metrics. Boy it sure would be bad if we had designed things so badly that simply reducing ads could increase revenue so much.” But that is not what they said.
    - jefftk 12 Nov 2019 15:03 UTC
      2 points
      Parent
      
      And the paper you linked showed that it wasn’t being done for most of Google’s history.
      
      This is a nitpick, but 2000-2007 (the period between when AdWords launched and when the paper says they started quantitative ad blindness research) is ¹⁄₃ of Google’s history, not “most”.
      
      I’m also not sure if the experiments could have been run much earlier, because I’m not sure identity was stable enough before users were signing into search pages.
      
      Also, this sort of optimization isn’t that valuable compared to much bigger opportunities for growth they had in the early 2000s.
      
      If Google doesn’t do it, I would be doubtful if anyone, even a peer like Amazon, does.
      
      Why are you saying Google doesn’t do it? I understand arguing about whether Google was doing it at various times, whether they should have prioritized it more highly, etc, but it’s clearly used and I’ve talked to people who work on it.
      
      Would you be interested in betting on whether Amazon has quantified the effects of ad blindness? I think we could probably find an Amazon employee to verify.
      
      Which is just another way of saying that before then they hadn’t used their long-term value measurements to figure out what threshold of ads to run before. Whether 2015 or 2013, this is damning.
      
      It’s specifically about mobile, which in 2013 was only about 10% of traffic and much less by monetization. Similar desktop experiments had been run earlier.
      
      But I also think you’re misinterpreting the paper to be about “how many ads should we run” and that those launches simply reduced the number of ads they were running. I’m claiming that the tuning of how many ads to run to maximize long-term value was already pretty good by 2013, but having a better experimental framework allowed them to increase long-term value by figuring out which specific kinds of ads to run or not run. As a rough example (from my head, I haven’t looked at these launches) imagine an advertiser is willing to pay you a lot to run a bad ad that makes people pay less attention to your ads overall. If you turn down your threshold for how many ads to show, this bad ad will still get through. Measuring this kind of negative externality that varies on a per-ad basis is really hard, and it’s especially hard if you have to run very long experiments to quantify the effect. One of the powerful tools in the paper is estimating long-term impacts from short term metrics so you can iterate faster, which makes it easier to evaluate many things including these kind of externalities.
      
      (As before, speaking only for myself and not for Google)