My Alignment Timeline

Epistemic status: If I don’t post this now, I’ll have a harder time treating it as something to build on.

Which chapter do we end on?

Which chapter of doomsday are you on, right now?

I think of AI-alignment-going-well as a process, where AI-causes-human-extinction-or-worse is an event that will by-default interrupt (and end) that process.

Many cruxes about AI alignment subproblems, reduce to “Which part do you think is The Part That Kills Us?”.

Consider the following game tree:

  1. Find a way to steer AI at all, as it reaches high capability levels.

    1. Part That Kills Us: We don’t find this.

    2. Part That Kills Us: We find something that looks like this, and implement it, but it actually doesn’t survive the sharp-left-turn.

    3. Good Ending: Steerability solves most of the problem, and we can then do a pivotal act.

    4. Good Ending: The capabilities “sharp left turn” doesn’t break our steering, and we can just apply existing human-goal-input-methodologies to steerable AI.

    5. Good Ending: We don’t need better “steering”; alignment by default, modulo some non-theoretic engineering advancements.

  2. Once AI is steerable, give it goals (or goal-replacing structures) that, at least, don’t involve extinction- or suffering-risks, and hopefully lead to really good outcomes.

    1. Part That Kills Us: We don’t do this, even given a steerable AI.

    2. Part That Kills Us: We find something that looks like this, and implement it, but it actually causes human extinction/​suffering-risks instead.

    3. Good Ending: The goal is good, or at least good enough.

So a “Part That Kills Us” is a leaf on the game-tree that ends badly, while a “Good Ending” is a leaf that avoids ending-badly.

The good news: a non-zero number of people seem to be working on each individual step in this game tree.

The bad news: they’ve attached their substep to the rest of their AI beliefs, and this broke the field.

Worlds Where Some Chapters Don’t Matter

I want to clarify some relationships among branches of this game tree. This will help deconfuse me, and maybe other people also.

  1. On the Good Ending 1.5 (alignment by default, appears as “1.e” on LessWrong), further AI alignment work is not needed. However, humanity is already pretty good at such “mere engineering” issues. Worlds where we should worry most about AI alignment, are the worlds where mere engineering doesn’t work. This, in turn, implies that AI alignment should focus on solving problems that arise in the worlds where “normal efforts” fail.

    Do technical and especially theoretic AI alignment, in case we’re in one of the branches with a [Part That Kills Us without a theoretical breakthrough].

  2. If we solve problem 1, there are 3 further paths: 1.3, 1.4, and needing to solve problem 2. However, we could reasonably plan for the scenario where, shortly after a solution to problem 1 is announced, we die because we didn’t solve problem 2. This, in turn, implies that the marginal-additional-AI-alignment-researcher (e.g. me, the author) should focus on solving problem 2 instead of 1, if they think 2.1 is the likely Part That Kills Us. This way, a solution to problem 2 can be quickly “loaded into” [the implementation of [the solution for problem 1]].

    Have the goal ready, in case [solving the steering quickly] leads to a Part That Kills Us.

  3. For any situation that gives us a Good Ending, some [research implied/​required by other branches] becomes superfluous. E.g., if we solve problem 1 and then do a pivotal act as per 1.3, we’d have bought plenty of time/​reduced extinction risk enough to spend more time on solving problem 2. Therefore, current-day researchers should focus on 1 instead of 2, if they think 1.3 is a likely Good Ending.

    This is of special importance to me because, as noted elsewhere, I’m trying to figure out what I can personally best do for AI alignment. If the answer is “something technical”, I’m still at risk of wasting time, or causing problems, if I pick the wrong subfield of alignment to do research in.

No comments.