jefftk comments on Experiments and Consent

jefftk 11 Nov 2019 0:49 UTC
20 points
Giving different results to different people for the same input is unethical.

You’re going to need to give more justification for this. Here are some examples that I think even someone who’s skeptical should be ok with:
- If we both get mystery-flavor dum-dum lollipops they won’t taste the same.
- If we both open packs of Magic cards you might get much better cards than I do.
- If we search Gmail for a phrase we’ll get different results.
- If we search Facebook for “John Smith” we should see different profiles, since FB considers the friend graph in ranking responses.
- If I search Amazon for “piezos” it shows me piezo pickup disks, but if I search it in an incognito window I get “Showing results for piezas”. This is because it has learned something about what sort of products I’m likely to want to buy.
- If we ask for directions on Waze we may get different routings. All the routes it sends people on are reasonable ones (as far as it knows) and you get much better routing than you’d get from a hypothetical Waze that didn’t have all its users as an experimental pool.
You give two arguments:

Even in just the online realm, it can cause major issues for people with learning disabilities or older people who aren’t able to deal with change. If they need help with software, it can be a blocker for them if what they experience is different from what they see in help pages or on other people’s computers.

It sounds like you’re mostly talking about user-interface experiments? Like, if Tumblr shows me different results than it shows you that doesn’t limit your ability to help me, or my ability to use help pages. Even just with UI experiments, your argument proves too much: it says it’s unethical for companies to ever change their UI. Now people who are used to it working one way need all need to learn how to use the new interface. And all the Stack Overflow answers are wrong now. But clearly making changes to your UI is ok!

If either A or B is better for the user, they are getting discriminated against by the random algorithm that chooses which version of the software to show them.

Companies run A/B tests when they don’t know which of A or B is better, and running these tests allows them to make products that are better than if they didn’t run the tests. Giving everyone worse outcomes to make sure everyone always gets identical outcomes would not be an improvement.

Are there other reasons behind your claim?
- Said Achmiz 11 Nov 2019 5:38 UTC
  19 points
  Parent
  Addendum to my other comment:
  
  … running these [A/B] tests allows them to make products that are better than if they didn’t run the tests.
  
  Empirically, as a trend across the industry, this has turned out to be false. “Design by A/B test” has dramatically eroded the quality of UI/UX design over the last 10-15 years.
  
  Giving everyone worse outcomes to make sure everyone always gets identical outcomes would not be an improvement.
  
  On the contrary, it quite often would be an improvement—and a big one. Not only are “worse” outcomes by the metrics usually used in A/B tests often not even actually worse by any measure that users might care about, but the gains from consistency (both synchronic and diachronic) are commonly underestimated (for example—clearly—by you); in fact such gains are massive, and compound in the long run. Inconsistency, on the other hand, has many detrimental knock-on effects (increased centralization and dependence on unaccountable authorities, un-democratization of expertise, increased education and support costs, the creation and maintenance of a self-perpetuating expert class and the power imbalances that result—all of these things are either directly caused, or exacerbated, by the synchronic and diachronic UI inconsistency that is rampant in today’s software).
- Said Achmiz 11 Nov 2019 5:27 UTC
  16 points
  Parent
  Even just with UI experiments, your argument proves too much: it says it’s unethical for companies to ever change their UI. Now people who are used to it working one way need all need to learn how to use the new interface. And all the Stack Overflow answers are wrong now. But clearly making changes to your UI is ok!
  
  One man’s modus tollens is another’s modus ponens. I wouldn’t go so far as to say “ever”, but the frequency of UI changes that is commonplace today, I would say, is indeed unethical. I do not agree that “clearly making changes to your UI is ok”. It may be fine—there may be good reasons to do it^[1]—but as far as I’m concerned, the default is that it’s not fine.
  
  The fact is, “people who are used to it working one way … all need to learn how to use the new interface” is a serious, and seriously underappreciated, problem in today’s UX design practices. Many, many hours of productivity are lost to constant, pointless UI changes; a vast amount of frustration is caused. What, in sum, is the human toll of all of this—this self-indulgent experimentation by UX designers, this constant “innovation” and chasing after novelty? It’s not small; not small by any means.
  
  I say that it is unethical. I say that if we, UX designers, had a stronger sense of professional ethics, then we would not do this, and instead would enshrine “thou shalt not change the UI unless you’re damn sure that it’s necessary and good for all users—existing ones most especially” in our professional codes of conduct.
  
  In short: the argument given in the grandparent proves exactly as much as it should.
  ↩︎
  And they needn’t be terribly dramatic reasons; “we added a feature” is a fine reason to change the UI… just enough to accommodate that feature.
  What links here?
  - Said Achmiz's comment on Experiments and Consent by jefftk (11 Nov 2019 5:38 UTC; 19 points)
  - jefftk 11 Nov 2019 13:41 UTC
    8 points
    Parent
    Changing UIs has costs to users. So does charging for your service. Is charging for your service unethical? Think about the vast amount of frustration caused by people not having enough money, just so the company can shovel even more money onto already overpaid CEOs. (Want to modus again?)
    
    I do think companies should seriously consider the disruption UI changes cause, just like they seriously consider the disruption of price increases, and often it will make sense for the company to put in extra development to save their users’ frustration. For example, for large changes like the ~2011 Gmail redesign you can have a period of offering both UIs with a toggle to switch between them. (And stats on how people use that toggle give you very useful information about how the redesign is working.)
    
    Companies that followed your suggestions would, over the years, look very dated. Their UIs wouldn’t be missing features, exactly, but their features would be clunky, having been patched onto UIs that were designed around an earlier understanding of the problem. As the world changed, and which features were most useful to users changed, the UI would keep emphasizing whatever was originally most important. Users would leave for products offered by new companies that better fit their needs, and the company would especially have a hard time getting new users.
    - Said Achmiz 11 Nov 2019 18:42 UTC
      9 points
      Parent
      
      Companies that followed your suggestions would, over the years, look very dated.
      
      “Dated” is not a problem unless you treat UX design like fashion. UIs don’t rust.
      
      their features would be clunky, having been patched onto UIs that were designed around an earlier understanding of the problem
      
      The “earlier understanding” of many problems in UX design was more correct. Knowledge and understanding in the industry has, in many cases, degenerated, not improved.
      
      As the world changed, and which features were most useful to users changed, the UI would keep emphasizing whatever was originally most important. Users would leave for products offered by new companies that better fit their needs, and the company would especially have a hard time getting new users.
      
      Yes, this is certainly the story that designers, engineers, and managers tell themselves. Sometimes it’s even true. Often it’s a lie, to cover the design-as-fashion dynamic.
      
      Changing UIs has costs to users. So does charging for your service. Is charging for your service unethical? Think about the vast amount of frustration caused by people not having enough money, just so the company can shovel even more money onto already overpaid CEOs. (Want to modus again?)
      
      Charging for your service isn’t unethical—though overcharging certainly might be! If companies didn’t charge for their service, they couldn’t provide it (and in cases where this isn’t true, the ethics of charging should certainly be examined). So, yes, once again.
      
      But that’s not the important point. Consider this thought experiment: how much value, translated into money, does the company gain from constant, unnecessary^[1] UI changes? Does the company even gain anything from this, or only the designers within it? If the company does gain some value from it, how much of this value is merely from not losing in zero-sum signaling/fashion races with other companies in the industry? And, finally, having arrived at a figure—how does this compare with the aggregate value lost by users?
      
      The entire exercise is vastly negative-sum. It is destructive of value on a massive scale. Nothing even remotely like “charging money for products or services” can compare to it. Every CEO in the world can go and buy themselves five additional yachts, right now, and raise prices accordingly, and if in exchange this nonsense of “UX design as fashion” dies forever, I will consider that to be an astoundingly favorable bargain.
      
      ↩︎
      That is, changes not motivated by specific usability flaws, specific feature additions, etc.
      - jefftk 11 Nov 2019 21:02 UTC
        4 points
        Parent
        
        “Dated” is not a problem unless you treat UX design like fashion. UIs don’t rust.
        
        “Dated” is a problem for companies because users care about it in selecting products. Compare:
        
        Original GMail: https://upload.wikimedia.org/wikipedia/en/6/67/Gmail_2004.png
        
        Current GMail: https://upload.wikimedia.org/wikipedia/en/1/1b/Gmail_inbox_in_Japanese.png
        
        The first UI isn’t “rusted”, but users looking at it will have a low impression of it and will prefer competing products with newer UIs. I don’t think fashion is the main motivator here, but it is real and you can’t make it go away just by unilaterally stopping playing. (I mean I can but I’m an individual running a personal website, not a company.)
        
        The “earlier understanding” of many problems in UX design was more correct. Knowledge and understanding in the industry has, in many cases, degenerated, not improved.
        
        How so? I can think of cases where earlier UX was a better fit for experienced users and newer UXes are “dumbed down”, is that what you mean?
        
        The entire exercise is vastly negative-sum. It is destructive of value on a massive scale.
        
        Let’s take a case where all the externalities should be internalized: internal tooling at a well run company. I use many internal UIs in my day-to-day work, and every so often one of them is reworked. There’s not much in the way of fashion here, since it’s internal, but there are still UI changes. The kind of general “let’s redo the UI and stop being stuck in a local maximum” is the main motivation, and I’m generally pretty happy with it.
        
        I don’t think the public-facing version is that different. If there was massive value destruction then users would move to software that changed UI less.
        
        Said Achmiz 11 Nov 2019 21:56 UTC
        7 points
        Parent
        
        users looking at it will have a low impression of it
        
        Mistakenly, of course. This is a well-attested problem, and is fundamental to this entire topic of discussion.
        
        I don’t think fashion is the main motivator here
        
        No, the halo effect is the main motivator.
        
        you can’t make it go away just by unilaterally stopping playing
        
        I never said that you could. (Although, in fact, I will now say that you can do so to a much greater extent than people usually assume, though not, of course, completely.)
        
        The “earlier understanding” of many problems in UX design was more correct. Knowledge and understanding in the industry has, in many cases, degenerated, not improved.
        
        How so? I can think of cases where earlier UX was a better fit for experienced users and newer UXes are “dumbed down”, is that what you mean?
        
        In part. A full treatment of this question is beyond the scope of a tangential comment thread, though indeed the question is worthy of a full treatment. I will have to decline to elaborate for now.
        
        If there was massive value destruction then users would move to software that changed UI less.
        
        In practice this is often impossible. For example, how do I move to a browser with which I can effectively browse every website, but whose UI stays static? I can’t (in large part because of anti-competitive behavior and general shadiness on the part of Google, in part because of other trends).
        
        The fact is that such simplistic, spherical-cow models of user behavior and systemic incentives fail to capture a large number and scope of “Molochian” dynamics in the tech industry (and the world at large).
        
        jefftk 12 Nov 2019 1:46 UTC
        2 points
        Parent
        
        users looking at it will have a low impression of it
        
        Mistakenly, of course. This is a well-attested problem, and is fundamental to this entire topic of discussion.
        
        I’m not sure that this is mistaken: companies that can keep their UI current can probably, in general, make better software. This probably only holds for large companies: since small companies face more of a choice of what to prioritize while large companies that look like they’re from 2005 are more likely to be environments that can’t get anything done.
        
        I’m generally pretty retrogrouch, and do often prefer older interfaces (I live on the command line, code in emacs, etc). But I also recognize that different interfaces work well for different people and as more people start using tech I get farther and farther from the norm.
        
        you can’t make it go away just by unilaterally stopping playing
        
        I never said that you could.
        
        That was how I interpreted your suggestion that UX people start to follow a “change UIs only when functionality demands”. Anyone who tried to do the “responsible” thing would lose out to less responsible folks. Even if you got a large group of UX people to refuse work they considered to be changing UIs for fashion, companies are in a much stronger position since the barrier to entry for UX work is relatively low.
        
        how do I move to a browser with which I can effectively browse every website, but whose UI stays static? I can’t.
        
        The rendering engines of Chrome/Edge/Opera (Blink), Safari (WebKit), and Firefox (Gecko) are all open source and there are many projects that wrap their own UI around a rendering engine. The amount of work is really not that much, especially on mobile (where iOS requires you to take this approach). If this was something that many people cared about it would not be hard for open source projects to take it on, or companies to sell it. That no one is prioritizing a UI-stable browser really is strong evidence that there’s not much demand.
        
        in large part because of anti-competitive behavior and general shadiness on the part of Google
        
        Not sure what you’re referring to here?
        
        Said Achmiz 12 Nov 2019 2:57 UTC
        5 points
        Parent
        
        companies that can keep their UI current can probably, in general, make better software
        
        To the contrary: companies that update their UI to be “current” probably, in general, make worse software (and not only in virtue of the fact that the UI updates often directly make the software worse).
        
        I’m generally pretty retrogrouch, and do often prefer older interfaces (I live on the command line, code in emacs, etc). But I also recognize that different interfaces work well for different people …
        
        Do they? It’s funny; I’ve seen this sort of sentiment quite a few times. It’s always either “well, actually, I like older UIs, but newer UIs work better [in unspecified ways] for some people [but not me]”, or “I prefer newer UIs, because they’re [vague handwaving about ‘modern’, ‘current’, ‘clean’, ‘not outdated’, etc.,]”. Much less frequent, somehow—to the point of being almost totally absent from my experience—are sentiments along the lines of “I prefer modern UIs, for the following specific reasons; they are superior to older UIs, which have the following specific flaws (which modern UIs lack)”.
        
        That was how I interpreted your suggestion that UX people start to follow a “change UIs only when functionality demands”. Anyone who tried to do the “responsible” thing would lose out to less responsible folks. Even if you got a large group of UX people to refuse work they considered to be changing UIs for fashion, companies are in a much stronger position since the barrier to entry for UX work is relatively low.
        
        But note that this objection essentially concedes the point: that the pressure toward “modernization” of UX design is a Molochian race to the bottom.
        
        The rendering engines of Chrome/Edge/Opera (Blink), Safari (WebKit), and Firefox (Gecko) are all open source and there are many projects that wrap their own UI around a rendering engine. The amount of work is really not that much, especially on mobile (where iOS requires you to take this approach).
        
        [emphasis mine]
        
        I have a hard time believing that you are serious, here. I find this to be an absurd claim.
        
        in large part because of anti-competitive behavior and general shadiness on the part of Google
        
        Not sure what you’re referring to here?
        
        Once again, it is difficult for me to believe that you actually don’t know what I’m talking about—you would have to have spent the last five years, at the very least, not paying any attention to developments in web technologies. But if that’s so, then perhaps the inferential distance between us is too great.
        
        jefftk 12 Nov 2019 14:38 UTC
        2 points
        Parent
        
        Much less frequent, somehow—to the point of being almost totally absent from my experience—are sentiments along the lines of “I prefer modern UIs, for the following specific reasons; they are superior to older UIs, which have the following specific flaws (which modern UIs lack)”.
        
        I think maybe what’s going on is that people who are good at talking about what they like generally prefer older approaches? But if you run usability tests, focus groups, A/B tests, etc you see users do better with modern UIs.
        
        But note that this objection essentially concedes the point: that the pressure toward “modernization” of UX design is a Molochian race to the bottom.
        
        I do think there’s a coordination failure here, as there is in any signaling situation. I think it explains less of what’s going on than you do, and I also don’t think getting UX people to agree on a code of ethics that prohibited non-feature-driven UI changes would be useful. (I also can’t tell if that’s a proposal you’re still pushing.)
        
        The amount of work is really not that much
        
        I have a hard time believing that you are serious, here. I find this to be an absurd claim.
        
        To be specific, I’m estimating that the amount of work required to build and maintain a simple and constant UI wrapper around a browser rendering engine is about one full time experienced engineer for two weeks to build and then 10% of their time (usually 0% but occasionally a lot of work when the underlying implementation changes) going forward. The interface between the engine and the UI is pretty clean. For example, have a look at Apple’s documentation for WebView:
        
        A WebView object is intended to support most features you would expect in a web browser except that it doesn’t implement the specific user interface for those features. You are responsible for implementing the user interface objects such as status bars, toolbars, buttons, and text fields. For example, a WebView object manages a back-forward list by default, and has goBack(_:) and goForward(_:) action methods. It is your responsibility to create the buttons that would send theses action messages.
        
        The situation on Android is similar. Hundreds of apps, including many single-developer ones, use WebView to bring a web browser into their app, with the UI fully under their control.
        
        in large part because of anti-competitive behavior and general shadiness on the part of Google
        
        Not sure what you’re referring to here?
        
        Once again, it is difficult for me to believe that you actually don’t know what I’m talking about—you would have to have spent the last five years, at the very least, not paying any attention to developments in web technologies.
        
        I’ve been paying a lot of attention to this, since that’s been the core of what I’ve worked on since 2012: first on mod_pagespeed and now on GPT. When I look back at the last five years of web technology changes the main things I see (not exhaustive, just what I remember) are:
        
        SPDY, QUIC, HTTP/2, HTTP/3, TLS 1.3 (and everything moved to HTTPS post-Snowden)
        Most sites can develop only for evergreen browsers (no dealing with IE8 etc)
        Service workers, web workers
        WebAssembly
        Browsers blocking identity in third-party contexts
        JavaScript modernization: Promises/async/await etc
        
        I’m still not sure what you’re referring to?
        
        (As before: I work at Google, and am commenting only for myself.)
  - ChristianKl 11 Nov 2019 10:09 UTC
    0 points
    Parent
    Empirically, as a trend across the industry, this has turned out to be false. “Design by A/B test” has dramaticallyeroded the quality of UI/UX design over the last 10-15 years.
    At the first glance this seems to me like “everything was better in the past”. It seems to me like a website that’s stuck in how things were done in the past like Wikipedia which doesn’t do any A/B tests loses in usability compared to more modern websites that are highly optimized.
    In the company where I work we don’t have A/B test and plenty of changes are made for reasons of internal company politics and as a result the users still suffer from bad UI changes.
    - Said Achmiz 11 Nov 2019 18:25 UTC
      5 points
      Parent
      
      At the first glance this seems to me like “everything was better in the past”.
      
      How do you get “everything was better in the past” out of what I wrote?
      
      I am saying that one specific category of thing was better in the past. For this to be unbelievable to you, to trigger this sort of response, you must believe that nothing was better in the past—which is surely absurd, yes?
      
      It seems to me like a website that’s stuck in how things were done in the past like Wikipedia which doesn’t do any A/B tests loses in usability compared to more modern websites that are highly optimized.
      
      Wikipedia has considerably superior usability to the majority of modern websites.
      - ChristianKl 12 Nov 2019 13:21 UTC
        14 points
        Parent
        To write a comment on this website I can click on “reply”, then write my text and click “submit”. On Wikipedia I would have to click on “edit” then find the right section to reply to. Once I have found it I have to decide on the right combination of * and : to put in front of my reply. After I wrote my comment I have to sign it by writing ~~~~. After jumping through those hoops I can click on “publish” (a recent change because user research suggested people were confused by “save”).
        Then if I’m lucky my post is published. If I’m unlucky I have to deal with a merge conflict. It’s hard for me to see Wikipedia here as user-friendly.
        This creates a pressure where some discussion about Wiki editing get pushed to Facebook or Telegram groups that are more user-friendly to use because it takes a lot less effort to write a new message.
        When it comes to menus you have a left side menus. You have the menus on the left and right side on the top of the article. Then you have the top menu on the right side. It’s not clear to a new user why “related changes” is somewhere completely different then “history”.
        More importantly the kind of results that A/B testing reveals are often not as obvious but there effects accumulate. The fact that Wikipedia lost editors over the last decade is for me a sign that they weren’t effective at evolving software that people actually want to use to contribute.
      - jefftk 11 Nov 2019 21:03 UTC
        2 points
        Parent
        
        Wikipedia has considerably superior usability to the majority of modern websites.
        
        Wikipedia is generally pretty good, but the “lines run the full width of your monitor on desktop no matter how wide your screen” is terrible.
        
        Said Achmiz 11 Nov 2019 21:45 UTC
        2 points
        Parent
        If that’s the most severe (or one of the most severe) problems with Wikipedia’s UI that you can think of, then this only proves my point. As you say, Wikipedia is generally pretty good—which cannot be said for the overwhelming majority of modern websites, even—especially!—those that (quite correctly and reasonably) conform to the “limit text column width” typographic guideline.
        
        jefftk 12 Nov 2019 1:49 UTC
        2 points
        Parent
        I didn’t introduce Wikipedia as an example of a site with poor UI. I think it’s pretty good aside from, as I said, the line width issue. It’s also in a space that people have a lot of experience with: displaying textual information to people. Wikipedia could likely benefit from some A/B tests to optimize their page load times, but that’s all behind the scenes.
- cousin_it 11 Nov 2019 9:59 UTC
  7 points
  Parent
  Another ethical consideration is that most A/B tests aren’t aimed to help the user, but to improve metrics that matter to the company, like engagement or conversion rate. All the sketchy stuff you see on the web—sign up for your free account, install our mobile app, allow notifications, autoplay videos, social buttons, fixed headers and footers, animated ads—was probably justified by A/B testing at some point.
  - jefftk 11 Nov 2019 14:16 UTC
    5 points
    Parent
    Companies optimize for making money, and while ideally they do that by providing value for people in some situations they’ll do that best by annoying users. The problem here is bad incentives, though, and if you took way A/B testing you’d just see cargo culting instead.
    - cousin_it 11 Nov 2019 17:54 UTC
      9 points
      Parent
      I agree that A/B tests aren’t evil, and are often useful. All I’m saying is, sometimes they give ammo to dark pattern thinking in the minds of people within your company, and reversing that process isn’t easy.
    - Matt Goldenberg 11 Nov 2019 18:39 UTC
      7 points
      Parent
      It’s not just money, but short term profits. A/B testing is an exceptionally good tool for measuring short term profits, but not as good a tool for measuring long term changes in behavior that come as a result of “optimized” design.
      - jefftk 11 Nov 2019 21:06 UTC
        3 points
        Parent
        Anyone who has a long-term view into user identity (FB, email providers, anywhere you log in) can totally do long-term experiments and account for user learning effects. Google published a good paper about this: Focusing on the Long-term: It’s Good for Users and Business (2015)
        
        (Disclosure: I work for Google)
        
        gwern 11 Nov 2019 22:07 UTC
        17 points
        Parent
        And, as that paper inadvertently demonstrates (among others, including my own A/B testing), most companies manage to not run any of those long-term experiments and do things like overload ads to get short-term revenue boosts at the cost of both user happiness and their own long-term bottom line.
        
        That includes Google: note that at the end of a paper published in 2015, for a company which has been around for a while in the online ad business, let us say, they are shocked to realize they are running way too many ads and can boost revenue by cutting ad load.
        
        Ads are the core of Google’s business and the core of all A/B testing as practiced. Ads are the first, second, third, and last thing any online business will A/B test, and if there’s time left over, maybe something else will get tested. If even Google can fuck that up for so long so badly, what else are they fucking up UI-wise? A fortiori, what else is everyone else online fucking up even worse?
        
        jefftk 12 Nov 2019 2:31 UTC
        2 points
        Parent
        
        Most companies manage to not run any of those long-term experiments and do things like overload ads to get short-term revenue boosts at the cost of both user happiness and their own long-term bottom line.
        
        The claim was that A/B testing was “not as good a tool for measuring long term changes in behavior” and I’m saying that A/B testing is a very good tool for that purpose. That companies generally don’t do it I think is mostly a lack of long-term focus, independent of experiments. I’m sure Amazon does it.
        
        Note that at the end of a paper published in 2015, for a company which has been around for a while in the online ad business, let us say, they are shocked to realize they are running way too many ads and can boost revenue by cutting ad load.
        
        The paper was published in 2015, but describes work on estimating long-term value going back to at least 2007. It sounds like you’re referring to the end of section five, where they say “In 2013 we ran experiments that changed the ad load on mobile devices … This and similar ads blindness studies led to a sequence of launches that decreased the search ad load on Google’s mobile traffic by 50%, resulting in dramatic gains in user experience metrics.” By 2013 they were certainly already taking into account long-term value, even on mobile (which was pretty small until just around 2013). This section isn’t saying “we set the threshold for the number of ads to run too high” but “we were able to use our long-term value measurements to better figure out which ads not to run”. So I don’t think “if even Google can fuck that up for so long so badly” is a good reading of the paper.
        
        Ads are the first, second, third, and last thing any online business will A/B test, and if there’s time left over, maybe something else will get tested.
        
        I work in display ads and I don’t think this is right. Where you see the most A/B testing is in funnels. If you’re selling something the gains from optimizing the flow from “user arrives on your site” to “user finishes buying the thing” are often enormous, like >10x. Whereas with ads if you just stick AdSense or something similar on your page you’re going to be within, say, 60% of where you could be with a super complicated header bidding setup. And if you want to make more money with ads your time is better spent on negotiating direct deals with advertisers than on A/B testing. I dearly wish I could get publishers to A/B test their ad setups!
        
        gwern 12 Nov 2019 3:08 UTC
        21 points
        Parent
        
        The claim was that A/B testing was “not as good a tool for measuring long term changes in behavior” and I’m saying that A/B testing is a very good tool for that purpose.
        
        And the paper you linked showed that it wasn’t being done for most of Google’s history. If Google doesn’t do it, I would be doubtful if anyone, even a peer like Amazon, does. Is it such a good tool if no one uses it?
        
        By 2013 they were certainly already taking into account long-term value, even on mobile (which was pretty small until just around 2013). This section isn’t saying “we set the threshold for the number of ads to run too high” but “we were able to use our long-term value measurements to better figure out which ads not to run”.
        
        Which is just another way of saying that before then they hadn’t used their long-term value measurements to figure out what threshold of ads to run before. Whether 2015 or 2013, this is damning. (As are, of course, the other ones I collate, with the exception of Mozilla who don’t dare make an explosive move like shipping adblockers installed by default, so the VoI to them is minimal.)
        
        The result which would have been exculpatory is if they said, “we ran an extra-special long-term experiment to check we weren’t fucking up anything, and it turns out that, thanks to all our earlier long-term experiments dating back many years which were run on a regular basis as a matter of course, we had already gotten it about right! Phew! We don’t need to worry about it after all. Turns out we hadn’t A/B-tested our way into a user-hostile design by using wrong or short-sighted metrics. Boy it sure would be bad if we had designed things so badly that simply reducing ads could increase revenue so much.” But that is not what they said.
        
        jefftk 12 Nov 2019 15:03 UTC
        2 points
        Parent
        
        And the paper you linked showed that it wasn’t being done for most of Google’s history.
        
        This is a nitpick, but 2000-2007 (the period between when AdWords launched and when the paper says they started quantitative ad blindness research) is ¹⁄₃ of Google’s history, not “most”.
        
        I’m also not sure if the experiments could have been run much earlier, because I’m not sure identity was stable enough before users were signing into search pages.
        
        Also, this sort of optimization isn’t that valuable compared to much bigger opportunities for growth they had in the early 2000s.
        
        If Google doesn’t do it, I would be doubtful if anyone, even a peer like Amazon, does.
        
        Why are you saying Google doesn’t do it? I understand arguing about whether Google was doing it at various times, whether they should have prioritized it more highly, etc, but it’s clearly used and I’ve talked to people who work on it.
        
        Would you be interested in betting on whether Amazon has quantified the effects of ad blindness? I think we could probably find an Amazon employee to verify.
        
        Which is just another way of saying that before then they hadn’t used their long-term value measurements to figure out what threshold of ads to run before. Whether 2015 or 2013, this is damning.
        
        It’s specifically about mobile, which in 2013 was only about 10% of traffic and much less by monetization. Similar desktop experiments had been run earlier.
        
        But I also think you’re misinterpreting the paper to be about “how many ads should we run” and that those launches simply reduced the number of ads they were running. I’m claiming that the tuning of how many ads to run to maximize long-term value was already pretty good by 2013, but having a better experimental framework allowed them to increase long-term value by figuring out which specific kinds of ads to run or not run. As a rough example (from my head, I haven’t looked at these launches) imagine an advertiser is willing to pay you a lot to run a bad ad that makes people pay less attention to your ads overall. If you turn down your threshold for how many ads to show, this bad ad will still get through. Measuring this kind of negative externality that varies on a per-ad basis is really hard, and it’s especially hard if you have to run very long experiments to quantify the effect. One of the powerful tools in the paper is estimating long-term impacts from short term metrics so you can iterate faster, which makes it easier to evaluate many things including these kind of externalities.
        
        (As before, speaking only for myself and not for Google)
        
        Matt Goldenberg 12 Nov 2019 16:31 UTC
        4 points
        Parent
        This is really cool, thanks for the link!
- quanticle 11 Nov 2019 6:00 UTC
  5 points
  Parent
  
  Companies run A/B tests when they don’t know which of A or B is better, and running these tests allows them to make products that are better than if they didn’t run the tests.
  
  The question is whether the cost of the test itself (users being confused by new UIs) outweighs the benefit of running the test. In my personal experience, both as a user and as tech-support, the benefits of new UIs are, at best, marginal. The costs, however, are considerable.
  
  The unstated assumption in your assertion is that A/B testing is the only way for companies to get feedback on their UIs. It isn’t. They can do user-testing with focus groups, and I would be willing to wager that they would learn as much from the focus groups as they would from the A/B tests on their production UI. The only reason to prefer A/B tests in production is because it’s cheaper, and the only reason it’s cheaper is because you’ve offloaded the externality of having to learn a new UI onto the user.
  - jefftk 11 Nov 2019 14:14 UTC
    12 points
    Parent
    (Assuming we’re still talking about A/B testing significant changes to UIs on products that a lot of people use, which is a very small part of A/B testing)
    
    The unstated assumption in your assertion is that A/B testing is the only way for companies to get feedback on their UIs. It isn’t.
    
    Wait, I don’t think this. Running lots of tiny tests and dogfooding can both give you early feedback about product changes before rolling them out. You can run extensive focus groups with real users once you have something ready to release. But if you take the results from those tests just launch to 100%, sometimes you’re going to make bad decisions . Real user testing is especially good for catching issues that apply infrequently, affect populations that are hard to bring in for focus groups, or that only come up after a long time using the product.
    
    Here’s an example of how I think these should be approached:
    
    Say eBay was considering a major redesign of their seller UI. They felt like their current UI was near a local maximum, but if they reworked it they could get somewhere much better.
    
    They run mockups by some people who don’t currently sell on eBay, and they like how much easier it is to list products
    
    They build out something fake but interactive and run focus groups, which are also positive.
    
    They implement the new version and make it available under a new URL, and add a link to the old version that says “try the new eBay” (and a link to the new one that says “switch back to the old eBay”).
    
    When people try the new UI and then choose to switch back they’re offered a comment box where they can say why they’re switching. Most people leave it blank, and it’s annoying to triage all the comments, but there are some real bugs and the team fixes them.
    
    At first they just pay attention to the behavior of people who click the link: are they running into errors? Are they more or less likely to abandon listings? This isn’t an A/B test and they don’t have a proper control group because users are self-selected and learning effects are hard, but they can get rough metrics that let them know if there are major issues they didn’t anticipate. Some things come up, they fix them.
    
    They start a controlled experiment where people opening the seller UI for the first time get either the new or old UI, still with the buttons for switching in the upper corner. They use “intention to treat” and compare selling success between the two groups. Some key metrics are worse, they figure out why, they fix them. This experiment starts looking positive.
    
    They start switching a small fraction of existing users over, and again look at how it goes and how many users chose to switch back to the old UI. Not too many switch back, and they ramp the experiment up.
    
    They add a note to the old UI saying that it’s going away and encouraging people to try out the new UI.
    
    They announce a deprecation date for the old UI and ramp up the experiment to move people over. At this point the only people on the old UI are people who’e tried the new UI and switched back.
    
    They put popups in the old UI asking people to say why they’re not switching. They fix issues that come up there.
    
    They turn down the old UI.
    
    It sounds like you’re saying they should skip all the steps after “They implement the new version and make it available under a new URL” and jump right to “They turn down the old UI”?
    - gbear605 12 Nov 2019 19:43 UTC
      1 point
      Parent
      That whole process seems plausibly ethical. The problem is that most companies go straight from “considering a major redesign” to “implement the new version” and then switch half of users over to the new UI and leave half on the old UI. And even with that whole step, I have literally seen disassociative episodes occur because of having a user interface changed (specifically, the Gmail interface update that happened last year). It should be done only with extreme care.
      - jefftk 12 Nov 2019 20:11 UTC
        2 points
        Parent
        
        the Gmail interface update that happened last year
        
        Are you talking about the Inbox deprecation?
        
        gbear605 12 Nov 2019 20:28 UTC
        1 point
        Parent
        No, the one described in https://www.theverge.com/2018/4/12/17227974/google-gmail-design-features-update-2018-redesign that came in April 2018
        jefftk 12 Nov 2019 21:26 UTC
        2 points
        Parent
        That’s not talking about a UI refresh, but about Gmail adding new features:
        
        Introduction of snooze
        Introduction of smart reply
        Offering attachment links in the message list view
        Collapsable sidebar
        
        Is that what you’re talking about or am I still looking at the wrong thing?
        
        gbear605 12 Nov 2019 21:43 UTC
        9 points
        Parent
        That rollout of new features also included a UI refresh making it look “cleaner.”
        See https://www.cultofmac.com/544433/how-to-switch-on-new-gmail-redesign/, this HN post, and https://www.theverge.com/2018/4/25/17277360/gmail-redesign-live-features-google-update which says “The new look, which exhibits a lot of softer forms and pill-shaped buttons, will have to prove itself over time”