What I got out of ‘Algorithms to Live By’

Sorry for the length. In a few paragraphs there’s a reader’s guide so you can skip around. I have tried to put the most valuable stuff up first. Consider only reading the introduction.

Starting from every moment, there are choices you could make. Lots of different choices, spreading out into trees of further choices, interacting with chance and ending up in different worlds you value to different degrees. As you think about which path to take, you learn more about what is likely on each branch. But as you gain more knowledge, you lose some opportunities: branches get left behind as you follow the track of waiting and thinking.

How are we supposed to figure out how to explore this space effectively? We need solutions that trade off integrating knowledge of the tree, future options and the cost of spending time thinking.

These are hard questions, and we don’t have complete answers, but we might look to those who have studied similar problems. One field of study that has overlap with this conundrum is that of algorithms. We can look at algorithms as case studies in rationality.

Making decisions is hard, and computer science is partly the study of finding the best decision given time and space constraints—and humans certainly face those constraints. We also face forgetfulness, impulsivity, weak mental maths … But we at least face time and space constraints. Algorithms to Live By by Brian Christian & Tom Griffiths is an exploration of the applicability of algorithms from computer science to human decision problems.

Though the book is flawed, I have changed my behaviour in some ways because of it, and am considering changing others as well.

Here are the three changes I’ve made that have been most worthwhile so far:

  • When I first get a set of new options that is likely to stay stable into the future, I prioritise choosing a new option over repeating a good choice (from Explore /​ Exploit). Odds of around 2:1 /​ 66% confident that this is an improvement.

  • I schedule internet access (unless I need to be responsive), and have my phone on aeroplane mode most of the time (from Scheduling). Odds above 9:1 /​ 90% confidence that this has been an improvement, but I have doubts about its long term feasibility.

  • For many things (email, paper & computer files) I no longer worry about having a good organisational system. I am prepared to pay the search cost when I need something rather than trying to pre-empt it by keeping things in their place. I think this is an improvement but I’m not that confident (maybe around 3:2 that it’s an improvement)

Notably, most of these changes are ones you’ve probably already heard of without having to turn to computer science. Probably, this is a good thing. I would consider it evidence against the book if it claimed it had lots of high value, very novel advice. Things like personal productivity have a fair amount of value on the line and a lot of ink has been spilled and money spent over it. If big wins were available to the first person to look at computer science, those wins would probably be found and known by now¹.

I mentioned the book has flaws. I’ll get to them in a second. It also has strengths.

  • Having an explicit model can be helpful. Yes, a lot of its advice is already encoded in my intuitions or in folk advice. But I’ve found the feeling that is has made concrete and explicit some of my intuitions about work valuable (for example spending a lot of time organising reference materials seems to always be busy work that doesn’t pay for itself)

  • It accepts and explores differences between our decision contexts and those contexts for which the algorithms were developed. It’s not just a naïve “act like a computer!” book

  • It makes sure one understands when a problem is algorithmically intractable. I am now more likely to look at complex, suboptimal situations as an opportunity to optimise in the sense of ‘improve’ rather than optimise in the sense of ‘perfect’ by default

On the other hand, there were definitely some problems. Sometimes it felt like the illustrative decisions were particularly weak. For example, the book opens with a discussion of so-called ‘optimal stopping’ problems. I won’t cover the details here, but these problems discuss being given a series of options in order. You can pass forever on an option or accept it and see no more options. How, asks the optimal stopping problem, can we maximise our probability of picking our most-preferred option?

The illustration the book provides for this is the problem of searching for somewhere to live. Regardless of how well this is modelled as a series of pass-accept options, it is certainly not well modelled as trying to maximise our probability of getting the best option. Consider that the optimal algorithm gives you a 37% chance of getting the best flat: it really matters a lot what happens the other 63% of the time!

Even in quite transferable cases, like sorting, it pays to remember a piece of old programming wisdom:

Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don’t get fancy. [...]

When we study complexity, we study behaviour as the number of items they’re processing gets large. Like really large. Tend-to-infinity large. It was a shame the book didn’t probe this at all.

Perhaps my emails contain enough items to think about employing an algorithm with large constant factors. Organising a class’ worth of marking probably doesn’t.

Reader’s Guide

If you want to know which sections to read (either in this post or in the book itself) based on the behaviour changes I decided to take away (or already use and endorse), then in rough order of expected usefulness (the sections of this post are also in this order):

  1. Bayes Rule for rules of thumb based on prior distribution

  2. Explore /​ Exploit, for taking new options vs the best known

  3. Scheduling for interrupt coalescing

  4. Networking for buffer bloat

  5. Computational Kindness for expressing preferences

  6. Networking (again) for exponential backoff

  7. Sorting & searching for a clearer sense of when sorting will be useful

  8. Caching for LRU & thinking about where to put your caches

  9. Overfitting for early-stopping and some thought-food for self-training and TAPs

If you want to practice ideas to do with coordination & equilibria: read the Game Theory section (especially its associated note)

If you want to know new concepts I got most excited about: read Constraint relaxation & randomness

Bayes Rule

I have put this section first because it provides some heuristics for making estimates based off a single observation and various typical priors. This could help a lot with explicit estimates and making predictions. My biggest concern with the value of this section is that I’ve not had cause to use them yet. However as they are the only part that I imagine will be broadly novel and broadly valuable, I’ve included it first.

Rules of Thumb

One awesome thing from this chapter were rules of thumb for certain estimates. The main estimates they work for are durations, where you have no information about when during the duration you’ve turned up and you want to estimate how long the total duration will be. They also work if you observe a number from a sequence—like serial numbers of taxis or, famously, position in the birth order of all people who will ever be born. You could then estimate the total number of taxis—or the expected lifetime of humanity!

There’s one rule of thumb for three different distributions: power law distribution, Erlang distribution and normal distribution. I also consider the case for lognormal, but it doesn’t add much to the previous cases.

Power law

In a power law distribution, there’s a fair amount of probability mass far out in the tail of the distribution. That is, there are values at very different scales than the median. Because values across such a range of scales are possible, you should multiply your observed result by some constant. What constant you ask? Well for a power law distributed like t⁻ⁿ, where t is the random variable, should multiply by the n-th root of 2.

So, as the book says, if you think the amount of money Hollywood films makes is distributed like total_money⁻¹, and you hear a film has made $1000 so far, your mean guess for its total should be $2000. But note, that’s your expectation of the total amount! The median should be less than that. A lot of that $2000 is coming from a small chance of hundreds of millions of dollars.

Erlang

I hadn’t encountered the Erlang distribution before. If you haven’t, first think of the exponential distribution. The exponential distribution covers the time between two occurrences of something that happens continually with the same average rate. The Erlang distribution generalises this to the time it takes for n such occurrences. When n is 1, the Erlang distribution collapses to the exponential.

Anyway, it turns out that you should expect a constant amount of time to pass before the next event, no matter when you observe it. This makes sense, because it’s the sum of variables that happen independently and memorylessly. According to the original paper, you should expect it to take βln(2) longer, where β is the average time between events. But that seems to me to be correct for the specific case where the Erlang distribution reduces to an exponential distribution, and not the general case. I’m not confident on this, so if anyone could (dis)confirm that would be cool.

Normal

There isn’t really a simple rule for a normal distribution in the same way for the others. Rather, for values much less than the mean, it’s safe to assume the mean (or just over). Once you’re over the average, expect to not go that much further over the average. That is, add a fairly small amount relative to the scale of the distribution. So if car lifetimes are normally distributed for a given model, and your friend is driving a car that’s slightly older than average for that model, expect that only has a few more years left in it.

While this isn’t the most satisfying rule, I could see it providing some use in Fermi-style estimates and hopefully my intuitions about it will sharpen.

Lognormal

Sometimes, I feel like my models of lognormal and power laws are kind of interchangeable. Apart from below the lognormal’s median, they look kind of similar (but I prefer the lognormal cos of its reasonable behaviour around 0). And indeed, we find that above the median of the lognormal distribution, an appropriate rule is also a multiplicative rule.

To see this, remember that the logarithm of a lognormally distributed variable is distributed normally (hence the name). So we can apply the rule for the normal distribution: if the logarithm of your observation is significantly less than the logarithm of the distribution’s median (so let’s say the observation is about half the median) just go with the median. If the logarithm of the observation is significantly greater than the median, we expect the logarithm of the final elapsed duration (or what-have-you) to be a bit bigger than the current logarithm.

If we model that as a constant addition to the logarithm, (as in log(expected) = log(observed) + log(k) = log(k * observed)), then we recover a multiplication heuristic!

These rules have the potential to be useful, but they don’t give you much guidance for shifting the distribution that you think underlies the phenomenon. They won’t help you update your belief about the mean of a normal distribution, nor that it looks more like an Erlang distribution than a power distribution. So they are best used when you have a lot of data to characterise the distribution, and little information about the object of observation.

Explore /​ Exploit

There is a tension between getting value from the best known option (‘exploiting’) and checking to see if there are still better options (‘exploring’). This chapter discussed some algorithmic approaches to that problem. I found especially useful the (in retrospect, obvious) point that exploring is more worthwhile the longer you have to enjoy a payoff. If you don’t have long, stick to exploiting; if you have years, shop around. As a result of this, I have increased my tendency to explore in some situations. In general, however, it seems I should be increasing my tendency to exploit. I’ve not taken action on this yet.

The chapter provides some evidence that humans tend to over-explore. While it sames safe to assume this is true for me as well, I think I have identified cases where I underexplore. Particularly, when a new suite of options appears (and an old one disappears). For example, moving to a new city (not trying enough different places) or attending a conference (stopping networking after meeting a few interesting people).

The literature on over-exploration is the strongest reason to think I might be wrong here, but there’s also a threat from something like social desirability bias. That is, exploration is considered laudable and cool, so I will be drawn to increase my exploration to also increase my status. (I reduced my estimate in the probability that this behaviour change was net good after writing this paragraph).

Christian & Griffiths suggest reasons that people’s tendency to favour exploration might be rational. They spend most time on the ideas that increased exploration (particularly returning to explore more periodically) makes more sense in a world without fixed payoffs. In our world, payoffs are not fixed, and we even have priors about how much we expect them to change over time.

I would also add that many of the studies that found overexploration (e.g. Tversky & Warren, 1966) studied situations that particularly favoured exploitation. (For example in the case above, something analogous to a stably biased coin). I fully accept the evidence that in extreme cases, humans tend to exploit insufficiently. I am more sceptical that this generalises.

There’s also the issue of combinatorial search when options may be complements or substitutes for each other. (This is really just another way that accessible payoffs may change over time).

For this issue, think again of moving to a new city or starting a new job. Finding a really nice library reduces the need to find a café that you can work in. A competitive tennis club you love that’s only open once a week increases the value of other places to practice. You understand the company better if you have worked with multiple teams.

Personally, I think I am prone to complacency in such scenarios. It may just be that I’m at the tail of the distribution on explore vs exploit. I’d be interested to see a study on people’s self-perceptions as explorers vs exploiters and how that correlates with reality.

At first it seems like learning a new skill would be another example: you can learn different subskills or use different instructors. But actually it seems like a counterexample. I’ve failed to learn many skills because of all the time I spent picking between learning resources and subskills when I should have just been practicing.

Scheduling

I derived most of my value in this section from further internalising the productivity risks of interruptions. It’s advice that’s not novel for most people, but it seems putting it into practice remains difficult. Therefore I rate the internalisation highly. If you are not familiar with the problems of interruptions, Cal Newport’s written a lot in this area. Here’s a blog post of his that came up when googling “Cal Newport interruptions”

It’s possible that removing interruptions just isn’t possible long term, in which case I shouldn’t have placed this section so highly. A problem with this section (of this post, as opposed to of the book), is that I don’t feel like I had many small insights I can summarise.

I did get one particular, communicable, useful idea from this chapter: interrupting someone more than a few times an hour can eat almost all the available work time in that hour. The authors draw this idea from a study that it might take minutes for a human to recover productivity from a context switch.

I couldn’t find the study in the notes to the book, and a single study isn’t strong evidence anyway. Nonetheless, the figure seems reasonable enough that I feel comfortable using it as a motivational bump. (If the figure isn’t reasonable, should we even be worried about interruptions?) But I hadn’t drawn out the specific implication from low number of interruptions to vanishing hours.

Should we be worried about the lack of concrete advice?

Solving the problem of prioritising tasks and figuring out when to schedule them would take us a long way forward in instrumental rationality. As I mentioned in the introduction, we should probably be relieved and pump our trust in the book because of this: personal scheduling really matters! It has big economic benefits for individuals and organisations. If this book offered large undiscovered gains in this area, I’d be pretty uneasy.

Networking

Because of a discussion of an idea called ‘buffer bloat’, I became keener to reduce the number of items on my todo & reading lists. I will consider implementing a digital ‘reference folder’ where I can put things that seem like they might be useful, rather than defaulting to putting them onto a ‘to read’ list.

I claim below that the analogy to humans seems pretty weak for buffer bloat. But I still think that allowing long lists in my life is a problem. It seems reasonable that I’m just connecting advice I’ve heard before that seems good to less well-supported advice. I’m not certain whether that should seriously reduce my confidence or not though (the hypothesis still has be relying on advice I evaluated well before).

The next most important idea I got was that of exponential backoff. Whether it’s revisiting a course of action that seems worthwhile but more-and-more likely to fail, checking to see if a software build is done, or attempting to schedule dinner with a friend that’s always busy, simply doubling the interval between attempts seems a reasonable first stab at keeping information-gathering costs down without giving up on promising avenues.

To give more detail on buffer bloat, as I understand it from this chapter, could not affect a human in an analogous way. The problem is information getting stuck at the back of a long queue, with the sender none the wiser. This makes the time until that information is processed unacceptably long. In networks, this can lead to the receiver thinking the sender takes a long time to receive and process responses. So the receiver responds by moderating its responses more than necessary.

If we’re thinking of a reading or a todo list, a human would rarely work through it in order, but would keep an eye out for high priority items (a counter-example for me is RSS: I often do churn through my feeds in order).

Nonetheless, I found it a useful lens to think with. Particularly the following quote:

One of the fundamental principles of buffers [...] is that they only work correctly when they are routinely zeroed out.

I’ve gone a long time without zeroing-out my reading list. Search costs (covered later) for valuable reading are definitely getting high. There is also a mental toll from awareness of its infinitude.

Computational Kindness

Computational kindness is not an algorithm, but the conclusion the authors draw. The idea is to bear in mind the implicit computational work are actions place on others. By being concrete and proposing specific actions or times, we can allow someone to only check rather than search. If we have the information to hand, if the interlocutor is doing us a favour or if they have higher relative demands on their computational resources, this is a good thing to consider.

I already offer preferences even when they’re weak and suggest times and dates for meetings, roughly for computational considerations. I have not yet thought of further ways to take this advice into account. Suggestions are welcome. The rest of this section is a concrete expansion of the reasoning behind computational kindness.

Imagine you and a friend are big film buffs, and want to go to the cinema together.

You could suggest a time, or just ask “when’s good for you?” If you suggest a time, and it doesn’t suit the person, they might feel awkward asking to meet at a different time. So maybe it’s best to let them offer.

But! That person then needs to search through their schedule for a good time, which can take quite a bit of work.

So maybe it’s kinder to suggest the time. Or, you could suggest a time and also communicate some constraints if the time doesn’t work. As in, “How about Tuesday? Or if you can’t do that, I’m flexible. I just can’t do the weekend and the week after next is less good.”

Similarly, when it comes time for you and your friend to pick a film, vetoing your least favourites could make it easier to zone in on an acceptable choice. Or, when writing a lengthy post, you could indicate at the beginning where to find things that might be of interest to the reader (obviously I’ve not done this perfectly, but I hope the intention is well-taken).

Sorting & searching

One thing I got from these chapters was thinking about why we sort. We normally sort stuff so that we can find stuff in it later! If your pile of papers is well-sorted, you can do a binary search on it and find stuff quickly. If you have to search through something unsorted, you might have to go through every item. If you expect to search through a large number of items a small amount of times, then you may not be much better off having sorted than just searching through it.

Also, the fact that the time it takes to sort stuff by comparison goes superlinearly with the number of items is an important insight! If it had been new to me, it would have been quite valuable (though probably still not enough to move this section up)

There are two problems with leaving more things unsorted:

  1. You might not have a good intuition about which things you look through often and which rarely. So you might be leaving a lot of efficiency uncaptured.

  2. When you pay the time costs matter. I imagine I’m not alone in the face-reddening experience of scrabbling through pages of notebooks and folders full of loose-leaf documents in meetings while everyone looks on. In some situations, spending more time in total sorting and searching is a good choice. There might not be much better to do while sorting and there might be large advantages to being able to find the relevant material quickly.

Sorting & searching are perhaps the most archetypical algorithmic activities, and these chapters did a fairly good job of expressing how much approaches could differ in efficiency. The classic comparison between bubble sort and merge sort really pumps up your intuition that there could be hacks to be found! Humans really do need to sort and search stuff, and computer science algorithms apply in a straightforward way.

For completeness, I will give some concrete sorting algorithm suggestions. But first, if you really have a lot of stuff to sort, remember to check the value of your time. It may well be better to pay someone to do this for you.

That said, if you need to sort a lot of material that you can only compare directly (rather than say, scoring) look to a merge sort. If you can compare it (by score, or its alphabetised), it may be better to use a radix sort.

I would love to know about efficiently but roughly sorting material. Especially when my comparisons are noisy or error-prone! The book didn’t discuss this, though Gwern has produced some practical prior art.

Caching

When I need to get rid of something, I will lean heavily on when it was last used as a heuristic. This comes from this chapter claiming a cache-management algorithm called LRU (Least-Recently Used) performs well in a variety of environments. Obviously I’ll need to watch out for cycles (like clothes that get worn only in one season).

Little evidence is provided (an article by a Dropbox intern and an early paper on caching) beyond the claim that LRU is great, so I would have to do research to back this up. The authors write

LRU [...] is the overwhelming favorite of computer scientists

which, if true, seems reasonable evidence of its suitability for adoption.

I will also consider placing items so that they’re close to where they’re needed. I’ll copy two items from the book here:

  1. Keeping gym items in a crate by the front door. If you only wear these clothes at the gym, you only need them while you’re out, so it makes sense to keep them on your outwards routes.

  2. Keeping hoover bags behind the sofa. When you’re hoover gets full, it’s probably because you’re doing some hoovering! If you’re lucky, it will tend to happen in the same place as well.

A possible way of using this is looking for your habit triggers in your life. For example, if you always remember to do your laundry when you have a shower, maybe move your laundry hamper to your bathroom.

Overfitting

Overfitting is what happens when your model is too sensitive to idiosyncratic details of your training data. It could be seen as failing to prioritise simplicity in your models over ad-hoc additions to capture exceptions. Discussion in this chapter has pushed me closer towards regularly timeboxing. (I have used the pomodoro method for the first time in years now that I’ve read the book). I normally think of timeboxing as a method for breaking down a task and reducing the delay until payoff. This chapter discussed its role in keeping work limited when marginal payoff becomes uncertain.

I doubt this works if you are trying to produce at the top of your field. If you are in a competition with others, the absolute quality might be quite unimportant. It just has to be unambiguously better than your competitors, even if all of you could have done work that was 90% as good in 10% of the time.

With overfitting, you end up predicting that data will at each point err from the ‘true average’ in the same way that the data you sampled did. It’s an annoying problem in Machine Learning. A naïve machine-learning algorithm doesn’t have a prior against complex models.

As humans, as well, we can be prone to adding an extra detail to our model: a complication we think we should probably account for. But piecemeal accounting for complications is dangerous. It makes the model harder to work with, and improvements to its output are dubious at best. Sticking with simplicity is frequently our best option. One idea the authors cover seemed particularly useful to me: early stopping. I hadn’t thought of this as a way to generate simplicity before.

To be concrete, one way to control how complex your models or plans are is to restrict the amount of time you allow yourself to generate them. For this to work, you have to actually explore simpler options first, which one might not lean towards instinctively. But if you force yourself to actually come up with a model/​solution in the time allotted, you are very likely to lean on simplicity. I claimed above that complexity is hard to work with. If that holds, and if you are limiting the amount of time to get a workable model, you should be able to constrain yourself to simpler models.

Game Theory

Game theory is worth knowing about. It was a pretty good gentle introduction into game theory and the ideas of equilibria. In general, I think better introductions are available in the LW-o-sphere, for example this recently curated post.

There was also some discussion of inadequate equilibria. Again, I think reading Yudkowsky’s book would better here. I don’t think the treatment will lead you to robust intuitions. For example, the authors discuss the game-theoretic problems with unlimited vacation: assuming you get some important benefit by taking little holiday relative to everyone else, and pay some cost by taking the most holiday, the equilibrium here is with no days of holiday (assuming the costs and benefits are large compared to taking a holiday). They then go on to discuss a software CEO who was paying employees to take a vacation:

[F]rom a game theoretic approach it’s actually misguided. Increasing the cash on the table in the prisoner’s dilemma, for instance misses the point: the change doesn’t do anything to alter the bad equilibrium. [...] The problem isn’t that vacations aren’t attractive; the problem is that everyone wants to take slightly less vacation than their peers, producing a game whose only equilibrium is no vacation at all. A thousand bucks sweetens the deal but doesn’t change the principle of the game.

Now, I think that what the authors are suggesting here is that $1000 is not much compared to the benefits and negatives of taking the least /​ most vacation. But it could sound like it’s as futile as increasing the money on the table in a prisoner’s dilemma, but it’s definitely not! Crucially, you get the money if you ‘cooperate’ (take holiday) even if others ‘defect’ (take no holiday). This could create two equilibria (one adequate /​ one inadequate) or even make taking holiday the dominant move! If you’d like more detail on that, see the game theoretic note at the end.

Relaxation & Randomness

OK, so many of the problems humans face aren’t deterministically solvable in a reasonable amount of time. That’s a shame, but it’s not really a surprise. What about if we ask for an OK solution? Or one that’s probably good? Or one with a high expected value? Or a solution that works as long as we can change some features of the decision problem, so we can look at that next?

The fields of algorithmic relaxation & randomness explore answers to the above questions. This was really exciting! Humans are pretty bad at reasoning with probability, few constraints are really fixed in stone, and we often have more options than can be considered. It seems like we could have uncovered an avenue for novel, valuable advice.

Unfortunately, these chapters were pretty slim on applicable algorithms. I guess that makes sense. As indicated above, we aren’t that great at probabilistic inference and calculation. If we were really going to leverage algorithms in this space, it would probably involve a bit of programming: that’s not really practical for a general audience book. As I can program, I intend to look into making tools for myself in this space.


Game Theoretic Note

Let’s model this simply. You can either play a strategy of taking holiday or not. We model the rest of the company as a single agent taking a ‘high’ or ‘low’ holiday strategy. If we imagine that everything is falling into the equilibrium in this scenario and everyone has the same payoff matrix, we can just imagine this as everyone takes holiday or not in unison.

The baseline is taking no holiday in a low holiday environment.

Compared to this, if you take no holiday in a high holiday environment, you get a payoff s, which represents increased likelihood of raises, promotions and so on.

If you take holiday in a low holiday environment, it costs you k. It could be the case that k = s, or even that k < s, though we probably imagine in most cases that k > s.

If you take holiday in a high holiday environment, you just get to enjoy the holiday! 😃 This is payoff h.

    You ↓ / Them →     | Low holiday | High holiday
    -----------------------------------------------
    Don't take holiday |      0      |      s
    Take holiday       |     -k      |      h

As discussed, not taking holiday dominates taking holiday if s > h. This leads to a bad equilibrium: one where no one takes any holiday.

What about if the CEO pays people take holiday? Then he’s adding benefit b to every situation where you take holiday

    You ↓ / Them →     | Low holiday | High holiday
    -----------------------------------------------
    Don't take holiday |      0      |      s
    Take holiday       |     b-k     |     b+h

So what are the cases here. Well, if b + h > s, or if b—k > 0 then not taking holiday no longer dominates.

If b—k > 0, but b + h < s, then there is no longer any equilibrium at all! That is, when no one was taking holiday, you’re happy to take it. But as soon as everyone is, it pays to defect! This turns the game into one more like poker where you have to predict opponent’s moves.

If b + h > s, but b—k < 0, there are now 2 equilibria. One at everyone taking holiday and one at no one taking holiday. If you unilaterally take holiday here, it turns out badly for you. But if everyone is taking holiday, you just make yourself worse off by not taking any. Getting from the bad equilibrium to the good one is … difficult.

And if b + h > s and b—k > 0, then taking holiday becomes the dominant option! That would be nice: everyone would just take holiday


1: I do plan to make use of the rules of thumb from the Bayes section, which I hadn’t heard of before. But they’re not really behaviour changes and I haven’t made any use of them yet