My understanding of the alignment problem

I’ve been clarifying my own understanding of the alignment problem over the past few months, and wanted to share my first writeups with folks here in case they’re useful:

https://​​www.danieldewey.net/​​risk/​​

The site currently has 3 pages:

  1. The case for risk: how deep learning could become very influential, training problems that could lead models to behave in systematically harmful ways, and what I think we should do about it. Inspired mainly by What failure looks like.

  2. Fermi estimate of future training runs: a short AI timelines estimate inspired by Forecasting transformative AI.

  3. Applications of high-capability models: some notes on how high-capability models could actually be trained, and how their behavior could become highly influential.

None of the ideas on the site are particularly new, and as I note, they’re not consensus views, but the version of the basic case I lay out on the site is very short, doesn’t have a lot of outside dependencies, and is put together out of nuts-and-bolts arguments that I think will be useful as a starting point for alignment work. I’m particularly hoping to avoid semantic arguments about “what counts as” inner vs outer alignment, optimization, agency, etc., in favor of more mechanical statements of how models could behave in different situations.

I think some readers on this forum will already have been thinking about alignment this way, and won’t get a lot new out of the site; some (like me) will find it to be a helpful distillation of some of the major arguments that have come out over the past ~5 years; and some will have disagreements (which I’m curious to hear about).

I thought about posting all of this directly on the Alignment Forum /​ LessWrong, but ultimately decided I wanted a dedicated home for these ideas.


Out of everything on the site, the part I’m most hoping will be helpful to you is my (re)statement of two main problems in AI alignment. These map roughly onto outer and inner alignment, though different people use those terms differently, so not everyone will agree:

As models become more capable, it looks like currently known training methods will run into fundamental safety problems, and become increasingly likely to produce models that behave in systematically harmful ways:

1. Evaluation breakdown: As a model’s behavior becomes more sophisticated, it will reach a point where an automated reward function or human evaluator will not be able to fully understand its behavior. In many domains, it will then become possible for models to get good evaluations by producing adversarial behaviors that systematically hide bad outcomes, maximize the appearance of good outcomes, and generally seek to control the information flowing to the evaluator instead of achieving the desired results.

Evaluation breakdown would produce high-capability models that appear to work as intended, but that will behave in arbitrarily harmful ways when that behavior is useful for producing good evaluations; this would be broadly analogous to a company using its advantages in resources, personnel, and specialized knowledge to keep regulators and the public in the dark about harms.

2. High-level distribution shift: Even if evaluation breakdown is avoided, a model may behave arbitrarily badly when its input distribution is different from its training distribution. Especially harmful behavior could occur under “high-level” distribution shifts – shifts that leave the low-level structure of the domain unchanged (e.g. causal patterns that allow prediction of future observations or consequences of actions), but change some high-level features of the broader situation the model is operating in. Since the basic structure of the domain is unchanged, a model could continue to behave competently in the new distribution, but its behavior could be arbitrarily different from what it was intended to do.

In practice, a model that is vulnerable to high-level distributional shift would perform well in many situations, but have some chance of behaving in systematically harmful ways when conditions change. For example, high-level distribution shift might cause a model to switch to harmful behavior in new situations (e.g. committing fraud when it becomes possible to get away with it, manipulating a country’s political process when the model gains access to the required resources, or creating an addictive product when the required technology is developed); or a model might continue to pursue proxies of good performance in situations where they are no longer appropriate (e.g. continuing to maximize a company’s profit and growth during national emergencies, or continuing to maximize sales when it becomes apparent that a product is harmful).


What’s next? Ultimately, I’m hoping to figure out what kinds of research projects are most likely to produce forward progress towards training methods that avoid evaluation breakdown and high-level distribution shift. A world where we’re making clear year-over-year progress towards these goals looks achievable to me.