Companies run A/B tests when they don’t know which of A or B is better, and running these tests allows them to make products that are better than if they didn’t run the tests.
The question is whether the cost of the test itself (users being confused by new UIs) outweighs the benefit of running the test. In my personal experience, both as a user and as tech-support, the benefits of new UIs are, at best, marginal. The costs, however, are considerable.
The unstated assumption in your assertion is that A/B testing is the only way for companies to get feedback on their UIs. It isn’t. They can do user-testing with focus groups, and I would be willing to wager that they would learn as much from the focus groups as they would from the A/B tests on their production UI. The only reason to prefer A/B tests in production is because it’s cheaper, and the only reason it’s cheaper is because you’ve offloaded the externality of having to learn a new UI onto the user.
(Assuming we’re still talking about A/B testing significant changes to UIs on products that a lot of people use, which is a very small part of A/B testing)
The unstated assumption in your assertion is that A/B testing is the only way for companies to get feedback on their UIs. It isn’t.
Wait, I don’t think this. Running lots of tiny tests and dogfooding can both give you early feedback about product changes before rolling them out. You can run extensive focus groups with real users once you have something ready to release. But if you take the results from those tests just launch to 100%, sometimes you’re going to make bad decisions
. Real user testing is especially good for catching issues that apply infrequently, affect populations that are hard to bring in for focus groups, or that only come up after a long time using the product.
Here’s an example of how I think these should be approached:
Say eBay was considering a major redesign of their seller UI. They felt like their current UI was near a local maximum, but if they reworked it they could get somewhere much better.
They run mockups by some people who don’t currently sell on eBay, and they like how much easier it is to list products
They build out something fake but interactive and run focus groups, which are also positive.
They implement the new version and make it available under a new URL, and add a link to the old version that says “try the new eBay” (and a link to the new one that says “switch back to the old eBay”).
When people try the new UI and then choose to switch back they’re offered a comment box where they can say why they’re switching. Most people leave it blank, and it’s annoying to triage all the comments, but there are some real bugs and the team fixes them.
At first they just pay attention to the behavior of people who click the link: are they running into errors? Are they more or less likely to abandon listings? This isn’t an A/B test and they don’t have a proper control group because users are self-selected and learning effects are hard, but they can get rough metrics that let them know if there are major issues they didn’t anticipate. Some things come up, they fix them.
They start a controlled experiment where people opening the seller UI for the first time get either the new or old UI, still with the buttons for switching in the upper corner. They use “intention to treat” and compare selling success between the two groups. Some key metrics are worse, they figure out why, they fix them. This experiment starts looking positive.
They start switching a small fraction of existing users over, and again look at how it goes and how many users chose to switch back to the old UI. Not too many switch back, and they ramp the experiment up.
They add a note to the old UI saying that it’s going away and encouraging people to try out the new UI.
They announce a deprecation date for the old UI and ramp up the experiment to move people over. At this point the only people on the old UI are people who’e tried the new UI and switched back.
They put popups in the old UI asking people to say why they’re not switching. They fix issues that come up there.
They turn down the old UI.
It sounds like you’re saying they should skip all the steps after “They implement the new version and make it available under a new URL” and jump right to “They turn down the old UI”?
That whole process seems plausibly ethical. The problem is that most companies go straight from “considering a major redesign” to “implement the new version” and then switch half of users over to the new UI and leave half on the old UI. And even with that whole step, I have literally seen disassociative episodes occur because of having a user interface changed (specifically, the Gmail interface update that happened last year). It should be done only with extreme care.
The question is whether the cost of the test itself (users being confused by new UIs) outweighs the benefit of running the test. In my personal experience, both as a user and as tech-support, the benefits of new UIs are, at best, marginal. The costs, however, are considerable.
The unstated assumption in your assertion is that A/B testing is the only way for companies to get feedback on their UIs. It isn’t. They can do user-testing with focus groups, and I would be willing to wager that they would learn as much from the focus groups as they would from the A/B tests on their production UI. The only reason to prefer A/B tests in production is because it’s cheaper, and the only reason it’s cheaper is because you’ve offloaded the externality of having to learn a new UI onto the user.
(Assuming we’re still talking about A/B testing significant changes to UIs on products that a lot of people use, which is a very small part of A/B testing)
Wait, I don’t think this. Running lots of tiny tests and dogfooding can both give you early feedback about product changes before rolling them out. You can run extensive focus groups with real users once you have something ready to release. But if you take the results from those tests just launch to 100%, sometimes you’re going to make bad decisions . Real user testing is especially good for catching issues that apply infrequently, affect populations that are hard to bring in for focus groups, or that only come up after a long time using the product.
Here’s an example of how I think these should be approached:
Say eBay was considering a major redesign of their seller UI. They felt like their current UI was near a local maximum, but if they reworked it they could get somewhere much better.
They run mockups by some people who don’t currently sell on eBay, and they like how much easier it is to list products
They build out something fake but interactive and run focus groups, which are also positive.
They implement the new version and make it available under a new URL, and add a link to the old version that says “try the new eBay” (and a link to the new one that says “switch back to the old eBay”).
When people try the new UI and then choose to switch back they’re offered a comment box where they can say why they’re switching. Most people leave it blank, and it’s annoying to triage all the comments, but there are some real bugs and the team fixes them.
At first they just pay attention to the behavior of people who click the link: are they running into errors? Are they more or less likely to abandon listings? This isn’t an A/B test and they don’t have a proper control group because users are self-selected and learning effects are hard, but they can get rough metrics that let them know if there are major issues they didn’t anticipate. Some things come up, they fix them.
They start a controlled experiment where people opening the seller UI for the first time get either the new or old UI, still with the buttons for switching in the upper corner. They use “intention to treat” and compare selling success between the two groups. Some key metrics are worse, they figure out why, they fix them. This experiment starts looking positive.
They start switching a small fraction of existing users over, and again look at how it goes and how many users chose to switch back to the old UI. Not too many switch back, and they ramp the experiment up.
They add a note to the old UI saying that it’s going away and encouraging people to try out the new UI.
They announce a deprecation date for the old UI and ramp up the experiment to move people over. At this point the only people on the old UI are people who’e tried the new UI and switched back.
They put popups in the old UI asking people to say why they’re not switching. They fix issues that come up there.
They turn down the old UI.
It sounds like you’re saying they should skip all the steps after “They implement the new version and make it available under a new URL” and jump right to “They turn down the old UI”?
That whole process seems plausibly ethical. The problem is that most companies go straight from “considering a major redesign” to “implement the new version” and then switch half of users over to the new UI and leave half on the old UI. And even with that whole step, I have literally seen disassociative episodes occur because of having a user interface changed (specifically, the Gmail interface update that happened last year). It should be done only with extreme care.
Are you talking about the Inbox deprecation?
No, the one described in https://www.theverge.com/2018/4/12/17227974/google-gmail-design-features-update-2018-redesign that came in April 2018
That’s not talking about a UI refresh, but about Gmail adding new features:
Introduction of snooze
Introduction of smart reply
Offering attachment links in the message list view
Collapsable sidebar
Is that what you’re talking about or am I still looking at the wrong thing?
That rollout of new features also included a UI refresh making it look “cleaner.”
See https://www.cultofmac.com/544433/how-to-switch-on-new-gmail-redesign/, this HN post, and https://www.theverge.com/2018/4/25/17277360/gmail-redesign-live-features-google-update which says “The new look, which exhibits a lot of softer forms and pill-shaped buttons, will have to prove itself over time”