OpenAI’s Alignment Plan is not S.M.A.R.T.

In response to Eliezer Yudkowsky’s challenge, I will show how the alignment research approach outlined by OpenAI lacks common desiderata for effective plans. Most of the deficiencies appear to be difficult or impossible to fix, and we should thus expect the plan to fail.

Meta-level description of what makes a plan good places great emphasis on the goals/​objectives of the plan. George T. Doran suggests goals be S.M.A.R.T.: Specific, Measurable, Achievable, Relevant, and Time-Bound.

Specific: OpenAI’s description of the goals as AI that is “Value aligned” and “Follow human intent” could be elaborated in much greater detail than these 5 words. Yet making them specific is no easy task. No definition exists of these words in sufficient detail to be put into computer code, nor does an informal consensus exist.

Measurable: There currently exists no good way to quantify value alignment and intent-following. It is an open question if such quantification is even possible to do in an adequate way, and OpenAI does not seem to focus on resolving philosophical issues such as those required to make value alignment measurable.

Achievable: The plan suggests a relatively narrow AI would be sufficient to contribute to alignment research, while being too narrow to be dangerous. This seems implausible: Much easier problems than alignment research have been called AGI-complete, and general reasoning ability is widely thought to be a requirement for doing research.

Relevant: The plan acknowledges existential risk from advanced AI, but the proposed goal is insufficient to end the period of acute risk. This gap in the plan must be closed. I do not think this is trivial, as OpenAI rejects MIRI-style pivotal acts. My impression of OpenAI is that they hope alignment can be solved to the extent that the “Alignment tax” becomes negative to such an overwhelming degree that deceptively aligned AI is not built in practice by anyone.

Time-bound: The plan is not time-bound. OpenAI’s plan is conceptualized as consisting of 3 pillars (Reinforcement Learning from Human Feedback, AI-assisted evaluation, AI doing alignment research), but the dependencies mean this will need to be 3 partially overlapping phases. The plan lacks information about timing or criteria for progressing from each phase to the next: When does OpenAI intend to pivot towards the last phase, AI-based alignment research?

Other problems

Robustness: The 3 steps outlined in the plan have the property that step 1 and 2 actively pushes humanity closer to extinction, and only if step 3 succeeds will the damage be undone. “Making things worse to make them better” is sometimes necessary, but this should not be done blindly. I suspect OpenAI disagrees, but I do not know what their objection would be.

Resource allocation: This is a central part of good plans in general, and is absent except for a few scattered remarks about underinvestment in robustness and interpretability. To the extent that the problem is ensuring alignment is ahead of capability (“Two progress bars”), this is crucial. The departure of key personnel from OpenAI’s alignment team suggests that OpenAI is working more on capability than alignment.

Risk analysis: OpenAI acknowledges that it may turn out that the least capable AI that can do alignment research is capable enough to be dangerous. This is described as a limitation but not further discussed. A better plan would go into details on analyzing such weak points.

Review: The plan calls for OpenAI to be transparent about how well the alignment techniques actually work in practice. From the outside, it is unclear if the plan is on track. The launch of ChatGPT seems to have not gone the way OpenAI expected, but the actual result and evaluation has not been published (yet).

Thoughts

The challenge was not framed as a request for a defensible, impartial analysis—we were asked for our thoughts. The thoughts I present below are honestly held, but are derived more from intuition.

Having “a realistic plan for solving alignment” is a high bar that OpenAI is far from meeting. No-one else can meet the bar, but “reality doesn’t grade on a curve”: Either we pass the inflexible criteria or we die. The recent alignment work from OpenAI seems to be far from the required level to solve alignment.

OpenAI calls the article “Our Approach to Alignment Research”, and not “Our Plan for Existential Safety from AGI”. This does not invalidate criticizing the article as a plan, but instead shows that OpenAI chose the wrong subject to write about.

A better plan probably exists for the development of GPT-4.

The capability work done by OpenAI is actively harmful by limiting our time to come up with a better plan.

I would also question the degree to which OpenAI management is committed to following this plan. The internal power structure of OpenAI is opaque, but worrying signs include the departure of key alignment researchers and statements from the CEO.

In my mind, I envision a scenario where the Alignment Team finds evidence that either:

  • Evaluating adversarially generated research is too hard

  • Building AI assistants in practice does not lead to insights about recognizing deceptive plans

  • The Alignment Team is unable to build an AI that can productively work on alignment research without the AI being potentially dangerous

  • There is a fundamental problem with RLHF (e.g., it only learns hijackable sense-data, without reference to reality)

In this scenario, I doubt the CEO would oblige if the Alignment Team requests that OpenAI stop capability work.

Alignment-Washing

I personally (weakly) don’t think the purpose of the plan is to end the period of acute risk from unaligned AI:

I strongly object to the expansive definition of “Alignment” being used in the document. If InstructGPT fails to follow simple instructions, this is a lack of capability and not a lack of alignment. If it keeps making unwanted toxic and biased answers, this is not misalignment. The goal of alignment is that the AI does not kill everyone, and this focus should not be diluted.

Microsoft is a major partner in OpenAI, and is well-known for their “Embrace, Extend, Extinguish” business strategy. I worry that a similar strategy may be used by OpenAI, where Alignment is extended to be about a large number of other factors, which are irrelevant but where OpenAI has a competitive advantage.

Stated very bluntly, the alignment work done in OpenAI may be a fig leaf similar to greenwashing, which could be called “Alignment-washing”.

This post is a summary of my presentation in the AISafety.com Reading Group session 264.