gwern comments on Do websites and apps actually generally get worse after updates, or is it just an effect of the fear of change?

gwern 13 Dec 2023 0:42 UTC
33 points
20
A related issue here is the Schlitz effect (see my advertising experiments). Even if you do test, your testing is typically set up in an asymmetrical way where you are guaranteed to ratchet downward in quality.

So, if you test only on newbies, you will obviously never observe a degradation in performance for everyone else, and so you can expect to ratchet down indefinitely (some changes are strictly superior, but particularly after you pluck all the low-hanging fruit, you will now tend to be on a Pareto frontier & many changes will involve tradeoffs); but if you test on everyone else too only to check for a degradation, you still will ratchet downwards on average due to inevitable errors, where the true effect size is bad but you accept the change anyway because the point estimate wasn’t too bad or it was even good. The solution here is to also experiment with improvements for everyone-else so they benefit on average, or to take a more rigorous decision-theoretic approach which sets the threshold explicitly using the long-term total costs rather than some arbitrary statistical threshold (so if you accept the risk of degradation, at least it pays off).

But if you do that, you can see that for a lot of design approaches, even when doing actual A/B testing, it is very difficult! For example, my advertising experiments find that big banner ads have harmful effects like −10%. This is, by any reasonable standard, a pretty large effect: a banner ad seems to single-handedly drive away a tenth of readers. However, to conclude this with any confidence (and I’m still not confident in this result because of the complexities of time-series testing and analysis), I need something like a million readers.

Now, imagine changing a logo: what could the effect of a logo redesign possibly be? I’d say that even 1% is stretching it, or a tenth the effect of having advertising. (I mean come on, people just don’t care about logos that much and are blind to them.) But sample sizes scale badly, so you’d need much more than just 10 million readers to measure any logo effect. (Maybe >100m because it’s a square root?)

A logo is, in some ways, the best-case scenario: a clean discrete change we could hypothetically A/B test per-user and track retention. And n>100m is in fact doable for a Reddit. (Google can, and has, run advertising A/B tests on billions of users, see my writeup.) But some things cannot be cleanly split by user or cohort, like the entire site functionality. It’s not feasible to make multiple versions of an app or ecosystem. (Me & Said Achmiz have enough trouble maintaining simply desktop/mobile × light/dark-mode versions of Gwern.net—and that’s only 4 almost-identical websites which change purely locally & client-side!) So you can’t A/B test at the user unit level, and they live and die at the system level of units.

Let’s consider the broad ‘economic’ argument: that systems as a whole will be constrained by competition, or, perhaps we should say, evolution. Can we hope for designs to steadily evolve to perfection? Not really. Consider how incredibly much variance there is in outcomes. If Reddit changes the logo, how much does that affect the ‘fitness’ of Reddit? Hardly at all. Even all the design decisions added up scarcely matter: an exogenous event like the Digg redesign or Gamergate or Trump has far greater effects. Similarly, it often matters much more for success to have the right timing be the first X to have an iPhone app, or to be the first Y to bribe streamers shamelessly enough to shill yours.

If we consider Reddit and all its competitors in evolutionary terms as a population of organisms which replicate and are being selected for better design (let’s generously assume here that their fitness is not directly opposed to better design), what sort of increase in fitness would we expect from generation to generation? Well, an evolutionary computation strategy has extremely high variance (as an on-policy model-free reinforcement learning algorithm) and is extremely sample-inefficient. To detect a subtle benefit like a +1% increase in traffic from a better logo given the extremely high variance in Internet traffic numbers and then whatever that translates to in fitness improvements, you would need… well, millions of Reddit competitors would be a good starting point.

Needless to say, there are not, and never will be, millions of Reddit competitors in the current Internet generation.

And indeed, when we look around at companies in the real-world economy, what we see is that incompetent inefficient companies persist all the time, indefinitely. If they are replaced, it is by another incompetent inefficient company. Should any company enjoy huge success due to intrinsic advantages, it will degrade as it grows. (Remember when being a Googler meant you were an elite hacker solving the hardest web problems, rather than a bureaucrat who worked at Zoomer IBM?) We don’t see companies converge under competition to do the same task almost equally efficiently—some companies are just way better, and others are way worse, indefinitely. This tells us that under existing conditions, evolution on system-level units like designs or corporations is almost non-existent due to weak connection to ‘fitness’ / high variance, and poor fidelity of replication. Under such conditions, the mutation load can be enormous and complexity not maintainable, and there will definitely be no ‘progress’ - as people discover all the time with evolutionary algorithms, you do still have to tune your hyperparameters or else it just won’t work in any reasonable time. (For much more on all this, and cites for things like persistence of incompetence, see my backstop essay.)

Whatever gains may be made, whatever local optima of design can be reached, they are being relentlessly eroded away by ‘mutations’ which accumulate because they are not then purged by ‘evolution’.

These mutations can happen for many reasons like errors in experiments as discussed above, or mistaken beliefs about the need to ensh—ttify an app, or just general bitrot/bitcreep, or inconsistent preferences among the managers/designers, or what have you, so taxonomizing them isn’t too useful compared to understanding why they can persist and accumulate.

Basically, the market is unable to ‘see’ or ‘enforce’ good design. If there is good design, it is created only by using more efficient methods than evolution/markets: like model-based designers or focused experiments. But such powerful non-grounded-in-the-backstop-of-evolution methods always come with many risks of their own… You’re stuck between the Scylla of absurdly inefficient grounded methods like evolution, and efficient but often misaligned methods like humans running experiments.
What links here?
- Do websites and apps actually generally get worse after updates, or is it just an effect of the fear of change? by lillybaeum (10 Dec 2023 17:26 UTC; 33 points)