The Alignment Agenda THEY Don’t Want You to Know About

The title of this post is completely tongue-in-cheek. I have been advised to lean into the unpopularity of my opinions, so that’s where it came from.

In this post we lay out perhaps the most surprising prediction of the ethicophysics, which is that any solution to the alignment problem will be wildly unpopular on LessWrong when it is initially posted. This should surprise you—LessWrong has mortgaged everything else it holds dear in order to prioritize solving the alignment problem—why would it react poorly to someone actually doing so?

Our model has the following components:

  • The alignment community is currently perceived to be at relatively low risk of solving the alignment problem in the next week (seems uncontroversial)

  • New insights will be required (seems uncontroversial)

  • Those insights will probably come from a relative outsider who is “hungry” for recognition (unclear, but doesn’t seem super unlikely to me a priori).

  • The only people hungry for recognition are people who don’t already have it. Therefore, this relative outsider would have to have very low status in the alignment community relative to the value of the contributions they are about to make, if they are going to have the appropriate level of hunger. (Seems like a solid deduction to me?) Basically, we are talking about an Einstein-shaped person who is still in their patent clerk phase.

  • When this person goes to post the answer to the alignment problem to LessWrong, they will have low enough accumulated karma that the post will be poorly received. Basically, people will reason, if this guy was about to knock a baseball into outer space, wouldn’t we already know his name and have his rookie card? (Seems likely to me a priori and like a perfectly reasonable cognitive shortcut for reasonable people to apply, especially in the face of any short and incomplete description of something as complex and weird as a solution to the alignment problem would have to be.)

  • By the Law of Conservation of Bullshit derived in Ethicophysics II, the potential bullshit (as measured by post karma) of the solution to the alignment problem cannot go up from where it starts without something seriously weird happening that requires strenuous effort on the part of multiple participants. (This relies on deeply understanding the content of Ethicophysics I and Ethicophysics II, but it’s a straightforward application of the results in those papers.)

  • Therefore, any solution to the alignment problem is likely to remain at negative karma until such time as it is accepted in consensus reality as being an actual solution to the alignment problem.

  • Quod erat demonstrandum