You can’t fetch the coffee if you’re dead: an AI dilemma

You can’t fetch the coffee if you’re dead. That’s obvious. Yet the statement neatly sums up a vexing problem for researchers and developers of Artificial Intelligence systems (AI). In his book Human Compatible: AI and the Problem of Control, Stuart Russell explains why programming an intelligent machine is tricky—even if its only job is to get the coffee.

You see, an intelligent machine understands that it can’t complete its objective if switched off for some reason. So, the goal of serving coffee has an implicit subgoal: avoid being switched off. And the surest way to avoid being switched off is, of course, to disable the off-switch. If a machine is intelligent, it can figure out how.

But at that point, the machine isn’t under human control anymore—which is a problem. This scenario isn’t unique to coffee robots but applies to any intelligent system with a specific objective. Self-preservation is critical because, well, there’s not much you can do once you’re dead.

There are other issues, too. An intelligent machine might go about things differently than its human programmers. For example, an AI instructed to cure cancer might induce different kinds of tumours in every living person, allowing it to conduct a range of medical trials all at the same time. That would be the fastest way to find a solution—but it would also be wrong.

Think long enough about any single objective, and you’ll notice problems. It’s like the genie in a bottle granting three wishes. More often than not, the third wish is to undo the first two because unintended side effects happened when the wisher forgot to mention something they cared about. In Greek mythology, King Midas wished that whatever he touched turned into gold. That backfired when he discovered food and water weren’t excluded. He died of starvation.

SINGLE PURPOSE

In all these examples, we optimise for a single objective and measure success only as achieving that goal. Current AI systems work like that, too. It hasn’t caused significant problems because, so far, we’ve only built ‘narrow AI’. That is, AI that does precisely one job. For example, a chess robot that follows the rules of chess to win a game it is told to play.

On the other hand, when instructed to ‘win a game of chess, following the rules of chess, ’ an intelligent machine might already cause trouble trying to interfere with others playing the game. And forgetting to add ‘following the rules of chess’ might send it looking for that off-switch first.

These examples seem a little dramatic, and, in reality, the unintended consequences of AI are less easy to spot. That’s not the same as irrelevant. We’ve already seen operational AI systems recommend hiring men instead of women and deny black people probation. These problems are real and affect real people in the real world.

In this article, however, we want to focus on the intelligent machines of the future. The ones that don’t yet exist, but LLMs like ChatGPT, Bard, and Claude inspire us to think about. What objective would we give such a machine? And how would we stop it acting out?

Should we explicitly tell it not to touch the off-switch? And to only be good? In that case, who decides what ‘good’ is? And if the machine is more intelligent than us, surely it would find a way around any instructions. So we need better ideas.

STUART RUSSELL’S THREE PRINCIPLES

In Human Compatible, Russell suggests an alternative approach summarised in three principles. These principles are not rigid laws to hardcode into AI but a guide to researchers and developers building a future where AI serves and benefits humanity.

Let’s look at all three before taking each in turn:

  1. The machine’s only objective is to maximise the realisation of human preferences.

  2. The machine is initially uncertain about what those preferences are.

  3. The ultimate source of information about human preferences is human behaviour.

Notice that preferences are central to every principle. Here, preference means you can order what you prefer, like in game theory. Say you like apples better than bananas and bananas more than clementines. Then you will also like apples more than clementines. Or perhaps you are indifferent between apples and bananas. In that case, you’ll choose either over a clementine.

Preferences can be over simple things, like fruit, or complex matters, like an entire life. In his three principles, Russell means the latter: if you could virtually experience two entirely different versions of your life, you know which one you prefer. Or that you don’t mind either way.


The first principle, then, means that the intelligent machine is purely altruistic: it does not attach any value to its own existence. It cares only about maximising the benefits it brings to humans. The system would still avoid getting damaged if its owner is unhappy about paying for repairs but may disable the off-switch to continue doing valuable things for people.

The second principle deals with that problem. It says that the machine initially doesn’t know if its owner would be okay with removing the off-switch. A machine that knows what’s best will single-mindedly pursue its objective, not caring about any collateral damage it causes. But a humble machine doesn’t assume and checks.

It looks to its owner and their behaviour to understand the best action. Over time, the machine gets better at predicting what you want and becomes more beneficial. That’s captured in the third and final principle.

BEYOND YOU

There is an important point to note. The intelligent machine does not try to identify or adopt one ideal set of preferences but satisfies the preferences of each person. So, how should the machine trade off the desires of multiple humans? The simplest solution is to treat everyone equally, striving for the greatest happiness for the greatest numbers.

That idea is similar to how many Western economies operate: people live their individual lives and make decisions that are best for themselves, often within family or organisational settings. And, somehow, those decisions add up to a functioning system.

Economic theory says that’s because people make efficient choices based on full information. But for most decisions, full information is too much to handle. Just think of choosing something to watch on Netflix. If you listed everything you’ve not yet seen and ranked the list, you’d know the best choice. By then, however, the evening is long over.

Netflix’s algorithm makes suggestions to help cut through the weeds. When you sign up for the service, it makes no assumptions about you. Its algorithm observes and learns from your choices. The platform makes sure to have content for you—as well as for others with different tastes. And if you change what you like, the algorithm adapts. Come to think of it, Netflix follows Russell’s three principles quite well.

Now imagine that kind of assistance with everyday life. An intelligent machine that knows and understands you and shows options you otherwise might have missed. And by combining the information about you and everyone else, the machine can find an equilibrium better than the one we occupy today.

The ideas we need for an intelligent machine exist. To turn it into reality, AI researchers and developers should focus not just on STEM but also on philosophy, psychology, and social sciences.