The Alignment Agenda THEY Don’t Want You to Know About
The title of this post is completely tongue-in-cheek. I have been advised to lean into the unpopularity of my opinions, so that’s where it came from.
In this post we lay out perhaps the most surprising prediction of the ethicophysics, which is that any solution to the alignment problem will be wildly unpopular on LessWrong when it is initially posted. This should surprise you—LessWrong has mortgaged everything else it holds dear in order to prioritize solving the alignment problem—why would it react poorly to someone actually doing so?
Our model has the following components:
The alignment community is currently perceived to be at relatively low risk of solving the alignment problem in the next week (seems uncontroversial)
New insights will be required (seems uncontroversial)
Those insights will probably come from a relative outsider who is “hungry” for recognition (unclear, but doesn’t seem super unlikely to me a priori).
The only people hungry for recognition are people who don’t already have it. Therefore, this relative outsider would have to have very low status in the alignment community relative to the value of the contributions they are about to make, if they are going to have the appropriate level of hunger. (Seems like a solid deduction to me?) Basically, we are talking about an Einstein-shaped person who is still in their patent clerk phase.
When this person goes to post the answer to the alignment problem to LessWrong, they will have low enough accumulated karma that the post will be poorly received. Basically, people will reason, if this guy was about to knock a baseball into outer space, wouldn’t we already know his name and have his rookie card? (Seems likely to me a priori and like a perfectly reasonable cognitive shortcut for reasonable people to apply, especially in the face of any short and incomplete description of something as complex and weird as a solution to the alignment problem would have to be.)
By the Law of Conservation of Bullshit derived in Ethicophysics II, the potential bullshit (as measured by post karma) of the solution to the alignment problem cannot go up from where it starts without something seriously weird happening that requires strenuous effort on the part of multiple participants. (This relies on deeply understanding the content of Ethicophysics I and Ethicophysics II, but it’s a straightforward application of the results in those papers.)
Therefore, any solution to the alignment problem is likely to remain at negative karma until such time as it is accepted in consensus reality as being an actual solution to the alignment problem.
Quod erat demonstrandum
I don’t think this is accurate, it depends more on how it’s presented.
In my experience, if someone posts something that’s controversial to the general LW consensus, but argues carefully and in details, addressing the likely conflicts and recognizing where their position differs from the consensus, how, why, etc., in short, if they do the hard work of properly presenting it, it’s well received. It may earn an agreement downvote, which is natural and expected, but it also earns a karma upvote for the effort put into exposing the point, plus those who disagreed engaging with the person explaining their points of disagreement.
Your point would be valid on most online forums, as people who aren’t as careful about arguments as LWers tend to conflate disliking with disagreeing, which results in a downvote is a downvote is a downvote. Most LWers, in contrast, tend to be well skilled at treating the two axes as orthogonal, and it shows.
That’s pretty fair, and an argument for me to be less trollish in my presentation. I have strong-agreed with you.
What the heck is “ethicophysics”?
A novel theory of the interactions between the physical components of a cyberphysical system and the cybernetic components of a cyberphysical system. Please see the sequence I have published for more details.
All I can find is this post, which links to a Substack post, which links to an Academia.edu page, which links to a PDF… but doesn’t let me view or download the PDF unless I log in.
Do you have an explanation of “ethicophysics” available somewhere… more accessible?
Look at the most recent post on my substack, which links to this github repo: https://github.com/epurdy/ethicophysics
The PDF is shown in full for me when I scroll down the academia.edu page, here’s an archive.is capture in case this is some sort of intermittent A/B testing thing.
Some thoughts since you ask for feedback elsewhere:
1) Part of the reason why this post is likely being downvoted is the clickbait title. This is not looked upon favorable on Less Wrong.
2) You make some pretty good points in this post, but you state it far too confidently, almost like a mathematical proof. If you want your posts to do well on Less Wrong, try not to make strong claims without correspondingly strong evidence.
This is the crux for me. I think a person capable of making a significant contribution to alignment would probably also be capable of making some sort of smaller but more legible/uncontroversial contribution to show their competence. To go with the Einstein example, he got a PhD in physics before producing his groundbreaking results, and then was able to present those results in a format that was accepted by the physics establishment.
I have a PhD in Computer Science (2013, University of Chicago). My dissertation was entitled “Grammatical Methods in Computer Vision”. My masters thesis was in complexity theory and was entitled “Locally Expanding Hypergraphs and the Unique Games Conjecture”. I also have one publication in ACM Transactions on Computation Theory on proving lower bounds in a toy model of computation.
I am an Engineering Fellow at [redacted] AI. My company went to Series A while I was leading its machine learning team. (I have since transitioned to being an individual contributor, because management sucks and is boring and I’m no good at it.) My company has twice received the most prestigious award handed out in its industry. I hold multiple patents related to my contributions at [redacted] AI.
I hold a patent for my work at Vicarious, where I was a senior researcher.
At one point, I quit my job and started a generative AI startup dedicated to providing psychotherapy. This model is online, and I can share a link to it in a DM if you are interested.
The state-sponsored German physics establishment famously sneered at Einstein’s work. The Nazi regime derided it as degenerate, “Jewish” physics. Sure, everyone who we actually respect now could recognize the value of his work after he started predicting novel astronomical phenomena. But it’s not like he ever could have gotten a job at a German university while the Nazis were in charge.
Maybe the problem is with my poor writing and sloppy craftsmanship, but maybe it is also partially a matter of LessWrong expecting the solution to the alignment problem to come with a lot less emotionally charged language and politically charged content than it logically would have to come with?
Interesting, this is more competence-requiring stuff than I expected.
Does the author having lower karma actually cause posts to be received more poorly? The author’s karma isn’t visible anywhere on the post, or even in the hover-tooltip by the author’s name. (One has to click through to the profile to find out.) Even if readers did know the author’s karma, would that really cause people to not just judge it by its content? I would be surprised.
Well, I’m more talking about the actual reputation one has in the alignment community, since that’s the thing that’s actually relevant to how a post is received. I have no idea what my reputation is like, but it would almost have to be “total unknown”.
Since predictions about the past are cheap, let’s make some ethicophysical predictions about the future.
My work will remain unpopular until it is endorsed by someone who is high status (say, a comparable amount of karma to John Wentworth), and will not truly enter the LessWrong canon until Eliezer Yudkowsky seriously engages with it.
These predictions are pretty obviously true, so I don’t claim many Brier points for making them. But I also can’t claim many Brier points for many points in the ethicophysics, since it’s really just a formalization of common sense applied to the moral domain.
I think it’s more likely you just need to explain yourself better. try making a single post, which is not a linkpost, does not ask the reader to read anything else, is less than 5k words, which explains your ideas end to end without claiming them correct, simply describe what the proposal is without asserting results you don’t have. if you have more than one author, you can use the multiple-author “we”, but use “I” otherwise. in other words, stop propping your ideas up with pompous writing and just explain yourself already.
Here is the best I could muster on short notice: https://bittertruths.substack.com/p/ethicophysics-for-skeptics
Since I’m currently rate-limited, I cannot post it officially.