My Central Alignment Priority (2 July 2023)

Epistemic status: The further you scroll, the more-important the points are.

I continue my deconfusion from my alignment timeline post. Goal: figure out which sub-problem to work on first in AI alignment.

Quick Notation

1 = steering

2 = goal

1.1k no steering found

1.2k poor steering found

1.3g steering lets us do pivotal

1.4g steering + existing goal inputs

1.5g steering not needed, alignment by default

2.1k no goal found, even with steering

2.2k poor goal found, combined with steering

2.2g goal is good/enough

The Logic

Let’s say that my effort single-handedly changed the relative balance of research between 1 and 2. So we ignore scenarios where my work doesn’t do anything. (By same intuition, we ignore 1.5g, since that doesn’t require effort.)

If I research 1 (first) and X happens, what happens?

I find bad ideas I think are good. --> I overconfidently promote it, wasting time at best and causing 1.2k at worse.
I find good ideas. --> 1 is solved --> 1.3, 1.4, or 2 must happen.
- 1.3g: I, or a group I trust, must be able to do pivotal act before 1′s solution leaks/is independently re-discovered.
- 1.4g: Good ending is easy to achieve by quickly coding an implementation of 1′s solution.
- 2: I, or a group I trust, must be able to solve 2 before 1′s solution leaks/is independently re-discovered.

If I research 2 (first) and X happens, what happens?

I find bad ideas I think are good. --> I overconfidently promote it, wasting time at best and causing 2.2k at worst.
I find good ideas. --> 2 is solved --> 1 must also be solved, for 2.2g to happen.
- 1: I, or a group I trust, must be able to solve 1.

The Bottom Line

I, personally, right now, should have the key research focus of “solve problem 2”.

If I get interesting ideas about problem 1, I should write them down privately and… well, I’m not sure, but probably not publish them quickly and openly.

This will be the case until/unless something happens that would have made me change my above logic.

Some things that have not happened, but which I would expect to change my mind on the above points:

Problem 1 or problem 2 gets obviously-solved. --> Jump to the other respective branch of the logic tree.
A global AI pause actually occurs, in a way that actually constrains Anthropic and OpenAI and DeepMind and Meta s.t. AGI timelines can be pushed further out. --> Tentatively prioritize working on problem 1 more, due to its higher “inherent” difficulty than problem 2.
Cyborgism succeeds so well that it becomes possible to augment the research abilities of AI alignment researchers. --> Drop everything, go get augmented, reevaluate the alignment situation (including the existence of the augmentations!) with my newfound brainpower.
Some “new” Fundamental Fact comes to my attention, that makes me redraw the game tree itself or have different Alignment Timeline beliefs. --> I redraw the tree and try again.
I get feedback that my alignment work is unhelpful or actively counterproductive. --> I redraw the tree (if it’s something minor), or I stop doing technical alignment research (if it’s something serious and not-easily-fixable).

More footnotes about my Alignment Timeline specifically.

The failure mode of a typical “capabilities-frontier” lab (OpenAI, Anthropic, DeepMind) is probably either 1.1k, 1.2k, or 2.1k.
As far as I know, Orthogonal is the only group devoting serious effort to problem 2. Therefore, my near-term focus (besides upskilling and getting a grant) is to assist their work on problem 2.
Orthogonal’s failure mode is probably 2.2k. In that scenario, we/they develop a seemingly-good formal-goal, give it to a powerful/seed AI on purpose, turn it on, and then the goal turns out to be lethal.
The components of my Timeline are orthogonal to many seemingly-”field-dividing” cruxes, including “scaling or algorithms?”, “ML or math?”, and “does future AI look more like ML or something else?”. I have somewhat-confident answers to these questions, and so do other people, but the weird part is that I think others’ answers are sometimes wrong, whereas they would think mine are either wrong or (at a first pass) mutually-exclusive.

For example, I’m clearly going for theoretical-leaning work like MIRI and especially Orthogonal, and I also think future superhuman AI will be extremely ML-based. Many people think “ML is the paradigm AND formal alignment is unhelpful”, or “ML is the wrong paradigm AND formal alignment is essential”.

I may write more about this, modulo if it seems worth it / I have time/energy at the time.