My Central Alignment Priority (2 July 2023)
Epistemic status: The further you scroll, the more-important the points are.
I continue my deconfusion from my alignment timeline post. Goal: figure out which sub-problem to work on first in AI alignment.
Quick Notation
1 = steering
2 = goal
1.1k no steering found
1.2k poor steering found
1.3g steering lets us do pivotal
1.4g steering + existing goal inputs
1.5g steering not needed, alignment by default
2.1k no goal found, even with steering
2.2k poor goal found, combined with steering
2.2g goal is good/enough
The Logic
Let’s say that my effort single-handedly changed the relative balance of research between 1 and 2. So we ignore scenarios where my work doesn’t do anything. (By same intuition, we ignore 1.5g, since that doesn’t require effort.)
If I research 1 (first) and X happens, what happens?
-
I find bad ideas I think are good. --> I overconfidently promote it, wasting time at best and causing 1.2k at worse.
-
I find good ideas. --> 1 is solved --> 1.3, 1.4, or 2 must happen.
-
1.3g: I, or a group I trust, must be able to do pivotal act before 1′s solution leaks/is independently re-discovered.
-
1.4g: Good ending is easy to achieve by quickly coding an implementation of 1′s solution.
-
2: I, or a group I trust, must be able to solve 2 before 1′s solution leaks/is independently re-discovered.
-
If I research 2 (first) and X happens, what happens?
-
I find bad ideas I think are good. --> I overconfidently promote it, wasting time at best and causing 2.2k at worst.
-
I find good ideas. --> 2 is solved --> 1 must also be solved, for 2.2g to happen.
1: I, or a group I trust, must be able to solve 1.
The Bottom Line
I, personally, right now, should have the key research focus of “solve problem 2”.
If I get interesting ideas about problem 1, I should write them down privately and… well, I’m not sure, but probably not publish them quickly and openly.
This will be the case until/unless something happens that would have made me change my above logic.
Some things that have not happened, but which I would expect to change my mind on the above points:
-
Problem 1 or problem 2 gets obviously-solved. --> Jump to the other respective branch of the logic tree.
-
A global AI pause actually occurs, in a way that actually constrains Anthropic and OpenAI and DeepMind and Meta s.t. AGI timelines can be pushed further out. --> Tentatively prioritize working on problem 1 more, due to its higher “inherent” difficulty than problem 2.
-
Cyborgism succeeds so well that it becomes possible to augment the research abilities of AI alignment researchers. --> Drop everything, go get augmented, reevaluate the alignment situation (including the existence of the augmentations!) with my newfound brainpower.
-
Some “new” Fundamental Fact comes to my attention, that makes me redraw the game tree itself or have different Alignment Timeline beliefs. --> I redraw the tree and try again.
-
I get feedback that my alignment work is unhelpful or actively counterproductive. --> I redraw the tree (if it’s something minor), or I stop doing technical alignment research (if it’s something serious and not-easily-fixable).
More footnotes about my Alignment Timeline specifically.
-
The failure mode of a typical “capabilities-frontier” lab (OpenAI, Anthropic, DeepMind) is probably either 1.1k, 1.2k, or 2.1k.
-
As far as I know, Orthogonal is the only group devoting serious effort to problem 2. Therefore, my near-term focus (besides upskilling and getting a grant) is to assist their work on problem 2.
-
Orthogonal’s failure mode is probably 2.2k. In that scenario, we/they develop a seemingly-good formal-goal, give it to a powerful/seed AI on purpose, turn it on, and then the goal turns out to be lethal.
-
The components of my Timeline are orthogonal to many seemingly-”field-dividing” cruxes, including “scaling or algorithms?”, “ML or math?”, and “does future AI look more like ML or something else?”. I have somewhat-confident answers to these questions, and so do other people, but the weird part is that I think others’ answers are sometimes wrong, whereas they would think mine are either wrong or (at a first pass) mutually-exclusive.
For example, I’m clearly going for theoretical-leaning work like MIRI and especially Orthogonal, and I also think future superhuman AI will be extremely ML-based. Many people think “ML is the paradigm AND formal alignment is unhelpful”, or “ML is the wrong paradigm AND formal alignment is essential”.
I may write more about this, modulo if it seems worth it / I have time/energy at the time.
NOTE: I used “goal”, “goals”, and “values” interchangably in some writings such as this, and this was a mistake. A more consistent frame would be “steering vs target-selection” (especially as per the Rocket Alignment analogy).