Yeah. The threshold for “okay, you can submit to alignmentforum” is way, way, way too high, and as a result, lesswrong.com is the actual alignmentforum. Attempts to insist otherwise without appropriately intense structural change will be met with lesswrong.com going right on being the alignmentforum.
Ok, slightly off topic, but I just had a wacky notion for how to break-up groupthink as a social phenomenon. You know the cool thing from Audrey Tang’s ideas, Polis? What if we did that, but we found ‘thought groups’ of LessWrong users based on the agreement voting. And then posts/comments which were popular across thought-groups instead of just intensely within a thought group got more weight?
Ah, as a non-Twitter user I hadn’t known about this. Neat.
Quote
For any given note, most users have not rated that note, so most entries in the matrix will be zero, but that’s fine. The goal of the algorithm is to create a four-column model of users and notes, assigning each user two stats that we can call “friendliness” and “polarity”, and each note two stats that we can call “helpfulness” and “polarity”. The model is trying to predict the matrix as a function of these values, using the following formula:
Note that here I am introducing both the terminology used in the Birdwatch paper, and my own terms to provide a less mathematical intuition for what the variables mean
μ is a “general public mood” parameter that accounts for how high the ratings are that users give in general
is a user’s “friendliness”: how likely that particular user is to give high ratings is a note’s “helpfulness”: how likely that particular note is to get rated highly. Ultimately, this is the variable we care about. or is user or note’s “polarity”: its position among the dominant axis of political polarization. In practice, negative polarity roughly means “left-leaning” and positive polarity means “right-leaning”, but note that the axis of polarization is discovered emergently from analyzing users and notes; the concepts of leftism and rightism are in no way hard-coded.
The algorithm uses a pretty basic machine learning model (standard gradient descent) to find values for these variables that do the best possible job of predicting the matrix values. The helpfulness that a particular note is assigned is the note’s final score. If a note’s helpfulness is at least +0.4, the note gets shown.
The core clever idea here is that the “polarity” terms absorb the properties of a note that cause it to be liked by some users and not others, and the “helpfulness” term only measures the properties that a note has that caused it to be liked by all. Thus, selecting for helpfulness identifies notes that get cross-tribal approval, and selects against notes that get cheering from one tribe at the expense of disgust from the other tribe.
This is the formalization of the concept “left hand whuffy” from Charlie Stross’s “down and out in the magic kingdom”, 2003. When people who usually disagree with people like you actually agree with you or like what you’ve said, that’s special and deserves attention. I’ve always wanted to see it implemented. I don’t usually tweet but I’ll have to look at this.
Good catch. I’d genuinely misremembered. I lump the two together, but generally far prefer Stross as a storyteller, even though Doctorow’s futurism is also first-rate, in a different dimension. I found the story in Down and Out to be Stross-quality.
That sort of good idea for a social network improvement is definitely signature Doctorow, though.
Another idea is to implement a similar algorithm to Twitter’s community votes: identify comments that have gotten upvotes by people who usually disagree with each other, and highlight those.
Very broadly speaking, alignment researchers seem to fall into five different clusters when it comes to thinking about AI risk:
MIRI cluster. Think that P(doom) is very high, based on intuitions about instrumental convergence, deceptive alignment, etc. Does work that’s very different from mainstream ML. Central members: Eliezer Yudkowsky, Nate Soares.
Structural risk cluster. Think that doom is more likely than not, but not for the same reasons as the MIRI cluster. Instead, this cluster focuses on systemic risks, multi-agent alignment, selective forces outside gradient descent, etc. Often work that’s fairly continuous with mainstream ML, but willing to be unusually speculative by the standards of the field. Central members: Dan Hendrycks, David Krueger, Andrew Critch.
Constellation cluster. More optimistic than either of the previous two clusters. Focuses more on risk from power-seeking AI than the structural risk cluster, but does work that is more speculative or conceptually-oriented than mainstream ML. Central members: Paul Christiano, Buck Shlegeris, Holden Karnofsky. (Named after Constellation coworking space.)
Prosaic cluster. Focuses on empirical ML work and the scaling hypothesis, is typically skeptical of theoretical or conceptual arguments. Short timelines in general. Central members: Dario Amodei, Jan Leike, Ilya Sutskever.
Mainstream cluster. Alignment researchers who are closest to mainstream ML. Focuses much less on backchaining from specific threat models and more on promoting robustly valuable research. Typically more concerned about misuse than misalignment, although worried about both. Central members: Scott Aaronson, David Bau.
Remember that any such division will be inherently very lossy, and please try not to overemphasize the differences between the groups, compared with the many things they agree on.
Depending on how you count alignment researchers, the relative size of each of these clusters might fluctuate, but on a gut level I think I treat all of them as roughly the same size.
Yeah. The threshold for “okay, you can submit to alignmentforum” is way, way, way too high, and as a result, lesswrong.com is the actual alignmentforum. Attempts to insist otherwise without appropriately intense structural change will be met with lesswrong.com going right on being the alignmentforum.
Ok, slightly off topic, but I just had a wacky notion for how to break-up groupthink as a social phenomenon. You know the cool thing from Audrey Tang’s ideas, Polis? What if we did that, but we found ‘thought groups’ of LessWrong users based on the agreement voting. And then posts/comments which were popular across thought-groups instead of just intensely within a thought group got more weight?
Niclas Kupper tried a LessWrong Polis to gather our opinions a while back. https://www.lesswrong.com/posts/fXxa35TgNpqruikwg/lesswrong-poll-on-agi
So, something like the community notes algorithm?
https://vitalik.eth.limo/general/2023/08/16/communitynotes.html
Ah, as a non-Twitter user I hadn’t known about this. Neat.
Quote
This is the formalization of the concept “left hand whuffy” from Charlie Stross’s “down and out in the magic kingdom”, 2003. When people who usually disagree with people like you actually agree with you or like what you’ve said, that’s special and deserves attention. I’ve always wanted to see it implemented. I don’t usually tweet but I’ll have to look at this.
Down and Out in the Magic Kingdom was by Cory Doctorow, not Stross.
Good catch. I’d genuinely misremembered. I lump the two together, but generally far prefer Stross as a storyteller, even though Doctorow’s futurism is also first-rate, in a different dimension. I found the story in Down and Out to be Stross-quality.
That sort of good idea for a social network improvement is definitely signature Doctorow, though.
Another idea is to upweight posts if they’re made by a person in thought group A, but upvoted by people in thought group B.
Yeah, I’m interested in features in this space!
Another idea is to implement a similar algorithm to Twitter’s community votes: identify comments that have gotten upvotes by people who usually disagree with each other, and highlight those.
This idea is definitely simmering in many people’s heads at the moment :)
How private are the LessWrong votes?
Would you want to do it overall or blog by blog. Seems pretty doable.
Currently, the information about who voted which way on what things is private to the individual who made the vote in question and the LW admins.
So if doing this on LW votes, it’d need to be done in cooperation with the LW team.
I’m pasting this here because it’s the sort of thing I’d like to see. I’d like to see where I fall in it, and at least the anonymized position of others. Also, it’d be cool to track how I move over time. Movement over time should be expected unless we fall into the ‘wrong sort of updateless decision theory’ as jokingly described by TurnTrout (and term coined by Wei Dai). https://www.lesswrong.com/posts/j2W3zs7KTZXt2Wzah/how-do-you-feel-about-lesswrong-these-days-open-feedback?commentId=X7iBYqQzvEgsppcTb