What I Was Thinking About Before Alignment

People frequently ask me about my backstory—how I got into alignment/​agency research, what I did before, that sort of thing. One of the main things I emphasize is that I was thinking about analogous problems in economics and especially in biology, and I think the view from that angle made it much more obvious where AI alignment was going to run into the same major barriers.

Below is an essay I wrote in summer 2017, arguing that understanding foundational problems of agency is the primary bottleneck to progress in a wide variety of scientific fields. Hopefully this will give some idea of where my views on alignment/​agency research stem from.

The Scientific Bottleneck

Imagine you’re in a sci-fi universe in the style of StarTrek or Stargate or the like. You’ve bumped into a new alien species, drama ensued, and now you’re on their ship and need to hack into their computer system. Actually, to simplify the discussion, let’s say you’re the aliens, and you’re hacking into the humans’ computer system.

Let’s review just how difficult this problem is.

You’re looking at billions of tiny electronic wires and switches and capacitors. You have a rough idea of the high-level behavior they produce—controlling the ship, navigating via the stars, routing communications, etc. But you need to figure out how that behavior is built up out of wires and switches and electronic pulses and whatnot. As a first step, you’ll probably scan the whole CPU and produce a giant map of all the switches and wires and maybe even run a simulation of the system. But this doesn’t really get you any closer to understanding the system or, more to the point, any closer to hacking it.

So how can we really understand the computer system? Well, you’ll probably notice pretty quickly that there’s regular patterns on the CPU. At the low level, there’s things like wires and switches. You might also measure the voltages in those wires and switches, and notice that the exact voltage level doesn’t matter much; there’s high voltages and low voltages, and the exact details don’t seem to matter once you know whether it’s high or low. Then you might notice some higher-level structures, patterns of wires and switches which form other standard elements, like memory elements and logic gates. But eventually, you’re going to exhaust the “hardware” properties, and you’ll need to start mapping “software”. That problem will be even harder: you’ll basically be doing reverse compilation, except you’ll need to reverse compile the operating system at the same time as the programs running on it, and without knowing what language(s) any of those programs were written in.

That’s basically the state of biology research today.

There’s millions of researchers poking at this molecule or that molecule, building very detailed pictures of small pieces of the circuitry of living organisms. But we don’t seem much closer to decoding the higher-level language. We don’t seem any closer to assigning meaning to the signals propagating around in the code of living organisms.

Of course, part of the problem is that organisms weren’t written in any higher level language. They were evolved. It’s not clear that it’s possible to assign meaning to a single molecular signal in a cell, any more than you could assign meaning to a single electron in a circuit. There certainly is meaning somewhere in the mess—organisms model their environments, so the information they’re using is in there somewhere. But it’s not obvious how to decode that information.

All that said, biologists have a major advantage over aliens trying to hack human computer systems: software written by humans is *terrible*. (Insert obligatory Java reference here.) Sure, there’s lots of abstraction levels, lots of patterns to find, but there’s no universal guiding principle.

Organisms, on the other hand, all came about by evolution. That means they’re a mad hodgepodge of random bits and pieces, but it also means that every single piece in that hodgepodge is *optimized*. Every single piece has been tweaked toward the same end goal.

The Problem: General

There’s a more general name for systems which arise by optimization: adaptive systems. Typical examples include biological organisms, economic/​financial systems, the brain, and machine learning/​AI systems.

Each of these fields faces the same fundamental problem as biology: we have loads of data on the individual components of a big, complicated system. Maybe it’s protein expression and signalling in organisms, maybe it’s financial data on individual assets in an economy, maybe it’s connectivity and firing data on neurons in a brain, maybe it’s parameters in a neural network. In each case, we know that the system somehow processes information into a model of the world around it, and acts on that model. In some cases, we even know the exact utility function. But we don’t have a good way to back out the system’s internal model.

What we need is some sort of universal translator: a way to take in protein expression data or neuron connectivity or what have you, and translate it into a human-readable description of the system’s internal model of the world.

Note that this is fundamentally a theory problem. The limiting factor is not insufficient data or insufficient computing power. Google throws tremendous amounts of data and computational resources into training neural networks, but decoding the internal models used by those networks? We lack the mathematical tools to even know where to start.

Bottleneck

A while ago I wrote a post on the hierarchy of the sciences, featuring this diagram:

Yeah, I know, it’s kinda cheesy. It was five years ago, ok?

The dotted line is what I called the “real science and engineering frontier”. The fields within the line are built on robust experiments and quantitative theory. Their foundations and core principles are well-understood, enough that engineering disciplines have been built on top of them. The fields outside have not yet reached that point. Fields right on the frontier or just outside are exciting places to be—these are the fields which are, right now, crossing the line from crude experiments and incomplete theories to robust, quantitative sciences.

What’s really interesting is that the fields on or just outside the frontier—biology, AI, economics, and psychology—are exactly the fields which study adaptive systems. And they are all stuck on qualitatively similar problems: decoding the internal models of complex systems.

This suggests that the lack of mathematical tools for decoding adaptive systems is the major bottleneck limiting scientific progress today.

Removing that bottleneck—developing useful theory for decoding adaptive systems—would unblock progress in at least four fields. It would revolutionize AI and biology almost overnight, and economics and psychology would likely see major advances shortly thereafter.

Questions

Let’s make the problem a little more concrete. Here are a few questions which a solid theory of adaptive systems should be able to answer.

  • How can we recognize adaptive systems in the wild? What universal behaviors indicate an adaptive optimizer?

  • There are already strong theoretical reasons to believe that any adaptive system which predicts effectively has learned to approximate some Bayesian model; the history of machine learning provides plenty of evidence supporting the theory as well. Given a fully specified adaptive system, e.g. a trained neural network, how can we back out the Bayesian model which it approximates?

  • Bayesian models are constrained by the rules of probability, but we can also add the rules of causality. How can we tell when an adaptive system (e.g. a neural net) has learned to approximate a causal model, and how can we back out that model?

  • Outside of machine learning/​AI, utility functions are generally unknown. We know that e.g. a bacteria is evolved to maximize evolutionary fitness, but how can we estimate the shape of the fitness function based on parameters of the optimized system?

  • Under what conditions will an adaptive system learn models with levels of abstraction? How can those abstractions be translated into something human-readable?

  • Once the fitness function and internal models used by a bacteria have been decoded, how can new information or objectives be passed back into the cell via chemical concentrations or genetic modification? More generally, how can human-readable information (including probabilities, causal relationships, utility, and abstractions) be translated back into the parameter space of an adaptive system?

Obviously this list is just a start, but it captures the flavor of the major problems.