A great example of a product actually changing for the worse is Microsoft Office. Up until 2003, Microsoft Office had the standard “File, Edit, …” menu system that was characteristic of desktop applications in the ’90s and early 2000s. For 2007, though, Microsoft radically changed the menu system. They introduced the ribbon. I was in school at the time, and there was a representative from Microsoft who came and gave a presentation on this bold, new UI. He pointed out how, in focus group studies, new users found it easier to discover functionality with the Ribbon than they did with the old menu system. He pointed out how the Ribbon made commonly used functions more visible, and how, over time, it would adapt to the user’s preferences, hiding functionality that was little used and surfacing functionality that the user had interacted with more often.
Thus, when Microsoft shipped Office 2007 with the Ribbon, it was a great success, and Office gained a reputation for having the gold standard in intuitive UI, right?
Wrong. What Microsoft forgot is that the average user of Office wasn’t some neophyte sitting in a carefully controlled room with a one-way mirror. The average user of Office was upgrading from Office 2003. The average user of Office had web links, books, and hand-written notes detailing how to accomplish the tasks they needed to do. By radically changing the UI like that, Microsoft made all of that tacit knowledge obsolete. Furthermore, by making the Ribbon “adaptive”, they actively prevented new tacit knowledge from being formed.
I was working helpdesk for my university around that time, and I remember just how difficult it was to instruct people with how to do tasks in Office 2007. Instead of writing down (or showing with screenshots) the specific menus they had to click through to access functionality like line or paragraph spacing, and disseminating that, I had to sit with each user, ascertain the current state of their unique special snowflake Ribbon, and then show them how to find the tools to allow them to do whatever it is they wanted to do. And then I had to do it all over again a few weeks later, when the Ribbon adapted to their new behavior and changed again.
This task was further complicated by the fact that Microsoft moved away from having standardized UI controls to making custom UI controls for each separate task.
Note how it’s two rows. The top row is text menus. The bottom row is a set of legible buttons and drop-downs which allow the user to access commonly used tasks. The important thing to note is that everything in the bottom row of buttons also exists as menu entries in the top row. If the user is ever unsure of which button to press, they can always fall back to the menus. Furthermore, documents can refer to the fixed menu structure allowing for simple text instructions telling the user how to access obscure controls.
Note how the Ribbon is multiple rows of differently shaped buttons and dropdowns, without clear labels. The top row is now a set of tabs, and switching tabs now just brings up different panels of equally arcane buttons. Microsoft replaced text with hieroglyphs. Hieroglyphs that don’t even have the decency to stand still over time so you can learn their meaning. It’s impossible to create text instructions to show users how to use this UI; instructions have to include screenshots. Worse, the screenshots may not match what the user sees, because of how items may move around or be hidden.
I suspect that many instances of UIs getting worse are due to the same sort of focus-group induced blindness that caused Microsoft to ship the ribbon. Companies get hung up on how new inexperienced users interact with their software in a tightly controlled lab setting, completely isolated from outside resources, and blind themselves to the vast amount of tacit knowledge they are destroying by revamping their UI to make it more “intuitive”. I think the Ribbon is an especially good example of this, because it avoids the confounding effect of mobile devices. Both Office 2003 and 2007 were strictly desktop products, so one can ignore the further corrosive effect of having to revamp the UI to be legible on a smartphone or tablet.
Websites and applications can definitely become worse after updates, but the company shipping the update will think that things are getting better, because the cost of rebuilding tacit knowledge is borne by the user, not the corporation.
This would be an example of internal vs external validity: it may well be the case in that in their samples of newbies, when posed the specific tasks for the first time, the ribbon worked well and the benefit was statistically established to some very high level of posterior probability; however, just because it was better in that exact setting...
This reminds me of Dan Luu’s analysis of Bruce Tog’s infamous claim that Apple’s thorough user-testing experiments proved, no matter how butthurt it makes people like me, that using the mouse is much faster than using the keyboard. This is something I don’t believe: no one would argue that mousing over a GUI keyboard is faster than using an actual keyboard, so why would that suddenly stop being true of non-alphanumeric-entry tasks? When I consider some sequences of Emacs commands which are muscle memory, I can’t believe Tog when he says that those keypresses actually took more than 1s and I just cognitively-blink out the delay and that’s how I delude myself into thinking that the keyboard is faster.
Of course, this isn’t quite what he says, even if he is clearly gesturing towards it in a bait-and-switch sort of way—he doesn’t describe the experiment in the first post, but in the second, he seems to say that the experiments were done back in the 1980s with first-time Mac users (indeed, probably first-time computer users, period). At which point one goes, ‘duh!’ Why on earth would you expect some sort of unfamiliar Function-key task to be faster than using the mouse to click on the little GUI icon?′ That would be surprising, shortcuts are always slow & awkward at first. But then it offers no evidence about the more relevant usecase: you are a first-time computer user only once in your life, but you may spend many decades using a particular OS or piece of software, and then you do want shortcuts.
A related issue here is the Schlitz effect (see my advertising experiments). Even if you do test, your testing is typically set up in an asymmetrical way where you are guaranteed to ratchet downward in quality.
So, if you test only on newbies, you will obviously never observe a degradation in performance for everyone else, and so you can expect to ratchet down indefinitely (some changes are strictly superior, but particularly after you pluck all the low-hanging fruit, you will now tend to be on a Pareto frontier & many changes will involve tradeoffs); but if you test on everyone else too only to check for a degradation, you still will ratchet downwards on average due to inevitable errors, where the true effect size is bad but you accept the change anyway because the point estimate wasn’t too bad or it was even good. The solution here is to also experiment with improvements for everyone-else so they benefit on average, or to take a more rigorous decision-theoretic approach which sets the threshold explicitly using the long-term total costs rather than some arbitrary statistical threshold (so if you accept the risk of degradation, at least it pays off).
But if you do that, you can see that for a lot of design approaches, even when doing actual A/B testing, it is very difficult! For example, my advertising experiments find that big banner ads have harmful effects like −10%. This is, by any reasonable standard, a pretty large effect: a banner ad seems to single-handedly drive away a tenth of readers. However, to conclude this with any confidence (and I’m still not confident in this result because of the complexities of time-series testing and analysis), I need something like a million readers.
Now, imagine changing a logo: what could the effect of a logo redesign possibly be? I’d say that even 1% is stretching it, or a tenth the effect of having advertising. (I mean come on, people just don’t care about logos that much and are blind to them.) But sample sizes scale badly, so you’d need much more than just 10 million readers to measure any logo effect. (Maybe >100m because it’s a square root?)
A logo is, in some ways, the best-case scenario: a clean discrete change we could hypothetically A/B test per-user and track retention. And n>100m is in fact doable for a Reddit. (Google can, and has, run advertising A/B tests on billions of users, see my writeup.) But some things cannot be cleanly split by user or cohort, like the entire site functionality. It’s not feasible to make multiple versions of an app or ecosystem. (Me & Said Achmiz have enough trouble maintaining simply desktop/mobile × light/dark-mode versions of Gwern.net—and that’s only 4 almost-identical websites which change purely locally & client-side!) So you can’t A/B test at the user unit level, and they live and die at the system level of units.
Let’s consider the broad ‘economic’ argument: that systems as a whole will be constrained by competition, or, perhaps we should say, evolution. Can we hope for designs to steadily evolve to perfection? Not really. Consider how incredibly much variance there is in outcomes. If Reddit changes the logo, how much does that affect the ‘fitness’ of Reddit? Hardly at all. Even all the design decisions added up scarcely matter: an exogenous event like the Digg redesign or Gamergate or Trump has far greater effects. Similarly, it often matters much more for success to have the right timing be the first X to have an iPhone app, or to be the first Y to bribe streamers shamelessly enough to shill yours.
If we consider Reddit and all its competitors in evolutionary terms as a population of organisms which replicate and are being selected for better design (let’s generously assume here that their fitness is not directly opposed to better design), what sort of increase in fitness would we expect from generation to generation? Well, an evolutionary computation strategy has extremely high variance (as an on-policy model-free reinforcement learning algorithm) and is extremely sample-inefficient. To detect a subtle benefit like a +1% increase in traffic from a better logo given the extremely high variance in Internet traffic numbers and then whatever that translates to in fitness improvements, you would need… well, millions of Reddit competitors would be a good starting point.
Needless to say, there are not, and never will be, millions of Reddit competitors in the current Internet generation.
And indeed, when we look around at companies in the real-world economy, what we see is that incompetent inefficient companies persist all the time, indefinitely. If they are replaced, it is by another incompetent inefficient company. Should any company enjoy huge success due to intrinsic advantages, it will degrade as it grows. (Remember when being a Googler meant you were an elite hacker solving the hardest web problems, rather than a bureaucrat who worked at Zoomer IBM?) We don’t see companies converge under competition to do the same task almost equally efficiently—some companies are just way better, and others are way worse, indefinitely. This tells us that under existing conditions, evolution on system-level units like designs or corporations is almost non-existent due to weak connection to ‘fitness’ / high variance, and poor fidelity of replication. Under such conditions, the mutation load can be enormous and complexity not maintainable, and there will definitely be no ‘progress’ - as people discover all the time with evolutionary algorithms, you do still have to tune your hyperparameters or else it just won’t work in any reasonable time. (For much more on all this, and cites for things like persistence of incompetence, see my backstop essay.)
Whatever gains may be made, whatever local optima of design can be reached, they are being relentlessly eroded away by ‘mutations’ which accumulate because they are not then purged by ‘evolution’.
These mutations can happen for many reasons like errors in experiments as discussed above, or mistaken beliefs about the need to ensh—ttify an app, or just general bitrot/bitcreep, or inconsistent preferences among the managers/designers, or what have you, so taxonomizing them isn’t too useful compared to understanding why they can persist and accumulate.
Basically, the market is unable to ‘see’ or ‘enforce’ good design. If there is good design, it is created only by using more efficient methods than evolution/markets: like model-based designers or focused experiments. But such powerful non-grounded-in-the-backstop-of-evolution methods always come with many risks of their own… You’re stuck between the Scylla of absurdly inefficient grounded methods like evolution, and efficient but often misaligned methods like humans running experiments.
To add to this, I’ve never even seen a serious analysis of how changes affect intermediate or advanced users, ‘power users’, for any software program whatsoever.
Even for very popular software like Word.
Has anyone ever attempted such an analysis before?
A great example of a product actually changing for the worse is Microsoft Office. Up until 2003, Microsoft Office had the standard “File, Edit, …” menu system that was characteristic of desktop applications in the ’90s and early 2000s. For 2007, though, Microsoft radically changed the menu system. They introduced the ribbon. I was in school at the time, and there was a representative from Microsoft who came and gave a presentation on this bold, new UI. He pointed out how, in focus group studies, new users found it easier to discover functionality with the Ribbon than they did with the old menu system. He pointed out how the Ribbon made commonly used functions more visible, and how, over time, it would adapt to the user’s preferences, hiding functionality that was little used and surfacing functionality that the user had interacted with more often.
Thus, when Microsoft shipped Office 2007 with the Ribbon, it was a great success, and Office gained a reputation for having the gold standard in intuitive UI, right?
Wrong. What Microsoft forgot is that the average user of Office wasn’t some neophyte sitting in a carefully controlled room with a one-way mirror. The average user of Office was upgrading from Office 2003. The average user of Office had web links, books, and hand-written notes detailing how to accomplish the tasks they needed to do. By radically changing the UI like that, Microsoft made all of that tacit knowledge obsolete. Furthermore, by making the Ribbon “adaptive”, they actively prevented new tacit knowledge from being formed.
I was working helpdesk for my university around that time, and I remember just how difficult it was to instruct people with how to do tasks in Office 2007. Instead of writing down (or showing with screenshots) the specific menus they had to click through to access functionality like line or paragraph spacing, and disseminating that, I had to sit with each user, ascertain the current state of their unique special snowflake Ribbon, and then show them how to find the tools to allow them to do whatever it is they wanted to do. And then I had to do it all over again a few weeks later, when the Ribbon adapted to their new behavior and changed again.
This task was further complicated by the fact that Microsoft moved away from having standardized UI controls to making custom UI controls for each separate task.
For example, here is the Office 2003 menu bar:
(Source: https://upload.wikimedia.org/wikipedia/en/5/54/Office2003_screenshot.PNG)
Note how it’s two rows. The top row is text menus. The bottom row is a set of legible buttons and drop-downs which allow the user to access commonly used tasks. The important thing to note is that everything in the bottom row of buttons also exists as menu entries in the top row. If the user is ever unsure of which button to press, they can always fall back to the menus. Furthermore, documents can refer to the fixed menu structure allowing for simple text instructions telling the user how to access obscure controls.
By comparison, this is the Ribbon:
(Source: https://kb.iu.edu/d/auqi)
Note how the Ribbon is multiple rows of differently shaped buttons and dropdowns, without clear labels. The top row is now a set of tabs, and switching tabs now just brings up different panels of equally arcane buttons. Microsoft replaced text with hieroglyphs. Hieroglyphs that don’t even have the decency to stand still over time so you can learn their meaning. It’s impossible to create text instructions to show users how to use this UI; instructions have to include screenshots. Worse, the screenshots may not match what the user sees, because of how items may move around or be hidden.
I suspect that many instances of UIs getting worse are due to the same sort of focus-group induced blindness that caused Microsoft to ship the ribbon. Companies get hung up on how new inexperienced users interact with their software in a tightly controlled lab setting, completely isolated from outside resources, and blind themselves to the vast amount of tacit knowledge they are destroying by revamping their UI to make it more “intuitive”. I think the Ribbon is an especially good example of this, because it avoids the confounding effect of mobile devices. Both Office 2003 and 2007 were strictly desktop products, so one can ignore the further corrosive effect of having to revamp the UI to be legible on a smartphone or tablet.
Websites and applications can definitely become worse after updates, but the company shipping the update will think that things are getting better, because the cost of rebuilding tacit knowledge is borne by the user, not the corporation.
This would be an example of internal vs external validity: it may well be the case in that in their samples of newbies, when posed the specific tasks for the first time, the ribbon worked well and the benefit was statistically established to some very high level of posterior probability; however, just because it was better in that exact setting...
This reminds me of Dan Luu’s analysis of Bruce Tog’s infamous claim that Apple’s thorough user-testing experiments proved, no matter how butthurt it makes people like me, that using the mouse is much faster than using the keyboard. This is something I don’t believe: no one would argue that mousing over a GUI keyboard is faster than using an actual keyboard, so why would that suddenly stop being true of non-alphanumeric-entry tasks? When I consider some sequences of Emacs commands which are muscle memory, I can’t believe Tog when he says that those keypresses actually took more than 1s and I just cognitively-blink out the delay and that’s how I delude myself into thinking that the keyboard is faster.
Of course, this isn’t quite what he says, even if he is clearly gesturing towards it in a bait-and-switch sort of way—he doesn’t describe the experiment in the first post, but in the second, he seems to say that the experiments were done back in the 1980s with first-time Mac users (indeed, probably first-time computer users, period). At which point one goes, ‘duh!’ Why on earth would you expect some sort of unfamiliar Function-key task to be faster than using the mouse to click on the little GUI icon?′ That would be surprising, shortcuts are always slow & awkward at first. But then it offers no evidence about the more relevant usecase: you are a first-time computer user only once in your life, but you may spend many decades using a particular OS or piece of software, and then you do want shortcuts.
A related issue here is the Schlitz effect (see my advertising experiments). Even if you do test, your testing is typically set up in an asymmetrical way where you are guaranteed to ratchet downward in quality.
So, if you test only on newbies, you will obviously never observe a degradation in performance for everyone else, and so you can expect to ratchet down indefinitely (some changes are strictly superior, but particularly after you pluck all the low-hanging fruit, you will now tend to be on a Pareto frontier & many changes will involve tradeoffs); but if you test on everyone else too only to check for a degradation, you still will ratchet downwards on average due to inevitable errors, where the true effect size is bad but you accept the change anyway because the point estimate wasn’t too bad or it was even good. The solution here is to also experiment with improvements for everyone-else so they benefit on average, or to take a more rigorous decision-theoretic approach which sets the threshold explicitly using the long-term total costs rather than some arbitrary statistical threshold (so if you accept the risk of degradation, at least it pays off).
But if you do that, you can see that for a lot of design approaches, even when doing actual A/B testing, it is very difficult! For example, my advertising experiments find that big banner ads have harmful effects like −10%. This is, by any reasonable standard, a pretty large effect: a banner ad seems to single-handedly drive away a tenth of readers. However, to conclude this with any confidence (and I’m still not confident in this result because of the complexities of time-series testing and analysis), I need something like a million readers.
Now, imagine changing a logo: what could the effect of a logo redesign possibly be? I’d say that even 1% is stretching it, or a tenth the effect of having advertising. (I mean come on, people just don’t care about logos that much and are blind to them.) But sample sizes scale badly, so you’d need much more than just 10 million readers to measure any logo effect. (Maybe >100m because it’s a square root?)
A logo is, in some ways, the best-case scenario: a clean discrete change we could hypothetically A/B test per-user and track retention. And n>100m is in fact doable for a Reddit. (Google can, and has, run advertising A/B tests on billions of users, see my writeup.) But some things cannot be cleanly split by user or cohort, like the entire site functionality. It’s not feasible to make multiple versions of an app or ecosystem. (Me & Said Achmiz have enough trouble maintaining simply desktop/mobile × light/dark-mode versions of Gwern.net—and that’s only 4 almost-identical websites which change purely locally & client-side!) So you can’t A/B test at the user unit level, and they live and die at the system level of units.
Let’s consider the broad ‘economic’ argument: that systems as a whole will be constrained by competition, or, perhaps we should say, evolution. Can we hope for designs to steadily evolve to perfection? Not really. Consider how incredibly much variance there is in outcomes. If Reddit changes the logo, how much does that affect the ‘fitness’ of Reddit? Hardly at all. Even all the design decisions added up scarcely matter: an exogenous event like the Digg redesign or Gamergate or Trump has far greater effects. Similarly, it often matters much more for success to have the right timing be the first X to have an iPhone app, or to be the first Y to bribe streamers shamelessly enough to shill yours.
If we consider Reddit and all its competitors in evolutionary terms as a population of organisms which replicate and are being selected for better design (let’s generously assume here that their fitness is not directly opposed to better design), what sort of increase in fitness would we expect from generation to generation? Well, an evolutionary computation strategy has extremely high variance (as an on-policy model-free reinforcement learning algorithm) and is extremely sample-inefficient. To detect a subtle benefit like a +1% increase in traffic from a better logo given the extremely high variance in Internet traffic numbers and then whatever that translates to in fitness improvements, you would need… well, millions of Reddit competitors would be a good starting point.
Needless to say, there are not, and never will be, millions of Reddit competitors in the current Internet generation.
And indeed, when we look around at companies in the real-world economy, what we see is that incompetent inefficient companies persist all the time, indefinitely. If they are replaced, it is by another incompetent inefficient company. Should any company enjoy huge success due to intrinsic advantages, it will degrade as it grows. (Remember when being a Googler meant you were an elite hacker solving the hardest web problems, rather than a bureaucrat who worked at Zoomer IBM?) We don’t see companies converge under competition to do the same task almost equally efficiently—some companies are just way better, and others are way worse, indefinitely. This tells us that under existing conditions, evolution on system-level units like designs or corporations is almost non-existent due to weak connection to ‘fitness’ / high variance, and poor fidelity of replication. Under such conditions, the mutation load can be enormous and complexity not maintainable, and there will definitely be no ‘progress’ - as people discover all the time with evolutionary algorithms, you do still have to tune your hyperparameters or else it just won’t work in any reasonable time. (For much more on all this, and cites for things like persistence of incompetence, see my backstop essay.)
Whatever gains may be made, whatever local optima of design can be reached, they are being relentlessly eroded away by ‘mutations’ which accumulate because they are not then purged by ‘evolution’.
These mutations can happen for many reasons like errors in experiments as discussed above, or mistaken beliefs about the need to ensh—ttify an app, or just general bitrot/bitcreep, or inconsistent preferences among the managers/designers, or what have you, so taxonomizing them isn’t too useful compared to understanding why they can persist and accumulate.
Basically, the market is unable to ‘see’ or ‘enforce’ good design. If there is good design, it is created only by using more efficient methods than evolution/markets: like model-based designers or focused experiments. But such powerful non-grounded-in-the-backstop-of-evolution methods always come with many risks of their own… You’re stuck between the Scylla of absurdly inefficient grounded methods like evolution, and efficient but often misaligned methods like humans running experiments.
To add to this, I’ve never even seen a serious analysis of how changes affect intermediate or advanced users, ‘power users’, for any software program whatsoever.
Even for very popular software like Word.
Has anyone ever attempted such an analysis before?