LW Frontpage Experiments! (aka “Take the wheel, Shoggoth!”)
Update: June 20th
After a few rounds of adjustments and careful examination of the data, we’ve decided to make the Enriched tab be the default for all logged-in users[1]. Anyone is who has not already switched tabs ever will be set to Enriched tab. If you dislike it, you can switch to the Latest tab, the prior default.
We’re not completely certain this is the correct choice long-term, but the results seem good enough to progress to a broader roll out for now, though we’ll keep monitoring.
We’ve also enabled a Recommended tab, so the available tabs are now:
Latest: 100% post from the Latest algorithm (using karma and post age to sort)
Enriched (default): 50% posts from the Latest algorithm, 50% posts from the recommendations engine
Recommended: 100% posts from the recommendations engine, choosing posts specifically for you based on your history
Subscribed: a feed of posts and comments from users you have explicitly followed
Bookmarks: this tab appears if you have bookmarked any posts
Update: May 13th
If you’re reading this, it’s possible you just found yourself switched to the Enriched tab. Congratulations! You were randomly assigned to be fed to the Shoggoth to a group of users automatically switched to the new posts list.
The Enriched posts list:
Is 50% the same algorithm as Latest, 50% ML-algorithm selected posts for you based on your post interaction history.
The sparkle icon next to the post title marks which posts were the result of personalized recommendations.
You can switch back at any time to the regular Latest tab if you don’t like the recommendations
We changed the name “Recommended” to “Enriched” to better imply that it contains 50% of the regular Latest posts. (We will probably soon add a Recommended tab that is 100% recommendations.)
You can read further discussion of the experiments in this comment.
Original Post, April 22nd
For the last month, @RobertM and I have been exploring the possible use of recommender systems on LessWrong. Today we launched our first site-wide experiment in that direction.
(In the course of our efforts, we also hit upon a frontpage refactor that we reckon is pretty good: tabs instead of a clutter of different sections. For now, only for logged-in users. Logged-out users see the “Latest” tab, which is the same-as-usual list of posts.)
Why algorithmic recommendations?
A core value of LessWrong is to be timeless and not news-driven. However, the central algorithm by which attention allocation happens on the site is the Hacker News algorithm[2], which basically only shows you things that were posted recently, and creates a strong incentive for discussion to always be centered around the latest content.
This seems very sad to me. When a new user shows up on LessWrong, it seems extremely unlikely that the most important posts for them to read were all written within the last week or two.
I do really like the simplicity and predictability of the Hacker News algorithm. More karma means more visibility, older means less visibility. Very simple. When I vote, I basically know the full effect this has on what is shown to other users or to myself.
But I think the cost of that simplicity has become too high, especially as older content makes up a larger and larger fraction of the best content on the site, and people have been becoming ever more specialized in the research and articles they publish on the site.
So we are experimenting with changing things up. I don’t know whether these experiments will ultimately replace the Hacker News algorithm, but as the central attention allocation mechanism on the site, it definitely seems worth trying out and iterating on. We’ll be trying out a bunch of things from reinforcement-learning based personalized algorithms, to classical collaborative filtering algorithms to a bunch of handcrafted heuristics that we’ll iterate on ourselves.
The Concrete Experiment
Our first experiment is Recombee, a recommendations SaaS, since spinning up our RL agent pipeline would be a lot of work.We feed it user view and vote history. So far, it seems that it can be really good when it’s good, often recommending posts that people are definitely into (and more so than posts in the existing feed). Unfortunately it’s not reliable across users for some reason and we’ve struggled to get it to reliably recommend the most important recent content, which is an important use-case we still want to serve.
Our current goal is to produce a recommendations feed that both makes people feel like they’re keeping up to date with what’s new (something many people care about) and also suggest great reads from across LessWrong’s entire archive.
The Recommendations tab we just launched has a feed using Recombee recommendations. We’re also getting started using Google’s Vertex AI offering. A very early test makes it seem possibly better than Recombee. We’ll see.
(Some people on the team want to try throwing relevant user history and available posts into an LLM and seeing what it recommends, though cost might be prohibitive for now.)
Unless you switch to the “Recommendations” tab, nothing changes for you. “Latest” is the default tab and is using the same old HN algorithm that you are used to. I’ll feel like we’ve succeeded when people switch to “Recommended” and tell us that they prefer it. At that point, we might make “Recommended” the default tab.
Preventing Bad Outcomes
I do think there are ways for recommendations to end up being pretty awful. I think many readers have encountered at least one content recommendation algorithm that isn’t giving them what they most endorse seeing, if not outright terrible uninteresting content.
I think it’s particularly dangerous to ship something where (1) your target metric really is only a loose proxy for value, (2) you’re detached from actual user experience, i.e. you don’t see their recommendations and can’t easily hear from them, (3) your incentives are fine with this.
I hope that we can avoid getting swallowed by Shoggoth for now by putting a lot of thought into our optimization targets, and perhaps more importantly by staying in contact with the recommendation quality via multiple avenues (our own recommendations as users of the site, user interviews and easy affordances for feedback, a broad range of analytics).
Further cost of a recommendations (common knowledge, Schelling discussion, author incentive)
As above, personalized algorithms mean we lose simplicity and interpretability of the site’s attention-allocation mechanism. It also means that we no longer have common-ish knowledge of which posts everyone else has seen. There’s sense of [research] community in feeling like we all read the same “newspaper” today and if some people are discussing a new post, I probably at least read its title.
That’s value I think we lose a good deal of, though I think we should be able to find mechanisms to offset it at least a bit. (Curated posts, which will continue, is one way creating common-ish knowledge around posts.)
Related, with a shared frontpage focused on recent posts, discussion (commenting) gets focused around the same few posts. If people’s reading gets spread out over more posts, they could make it harder for conversations to happen. Maybe that will be fine and is worth attention going to the best posts overall for people. Also I think we might be able to find another mechanisms of coordinating discussion. I like the idea of trying a combined post/comment feed like Facebook/Twitter[3] that will show user’s comments recently made by others when it’s on a post or by another user that someone is likely interested in[4]. Such a feed if used by many could allow discussion to spring up again on older posts too, which would be pretty cool.
I’ve had one team member comment that using personalized recommendations, they feel some different of feeling as an author because they don’t know when/where/for who their post will show up, unlike with the current system. I think this is true, but also doesn’t seem to stop people posting on Facebook or Twitter, so likely not a dealbreaker. I do like the idea of providing analytics to authors showing how many people were displayed a post, clicked on it, etc., possible serving as an escape valve to catch if the algorithm is doing something dumb.
Thoughts?
Please share anything you do/don’t like from recommendations or any of the new frontpage tabs we’ve shipped. Especially great would be screenshots of your posts list with your reaction to them – lists of posts that are particularly great or terrible.
Also happy to get into thoughts about the general use of recommendations on LW in the comments here. Cheers.
- ^
This is mostly about enabling recommendations for logged-out users requiring some more technical work.
- ^
Since the dawn of LessWrong 2.0, posts on the frontpage have been sorted according to the HackerNews algorithm:
Each posts is assigned a score that’s a function of how much karma it was and how it old is, with posts discounted over time. In the last few years, we’ve enabled customization by allowing users to manually boost or penalize the karma of posts in this algorithm based on tag. The site has default tag modifiers to boost Rationality and World Modeling content (introduced when it seemed like AI content was going to eat everything).
- ^
We have Recent Discussion which is a pure chronological feed of posting and commenting activity that I find its a bit too much of a firehose with lots of low-interest stuff, so I don’t look at it much.
- ^
Since trying out the “subscribe to user’s comments” feature that we shipped recently, I’ve found this to be an interesting way to discover posts to read. I’m motivated to read things people I like are discussing.
- ^
For now, the tabs are only visible to logged-in users, though the frontpage redesign has been rolled out to everyone. Logged-out users see the contents of the “Latest” tab (which is what the previous frontpage showed under the “Latest Posts” section).
I am sceptical of recommender systems—I think they are kind of bound to end up in self reinforcing loops. I’d be more happy seeing a more transparent system—we have tags, upvotes, the works, so you could have something like a series of “suggested searches”, e.g. the most common combinations of tags you’ve visited, that a user has a fast access to while also seeing what precisely is it that they’re clicking on.
That said, I do trust this website of all things to acknowledge if things aren’t going to plan and revert. If we fail to align this one small AI to our values, well, that’s a valuable lesson.
I’m generally not a fan of increasing the amount of illegible selection effects.
On the privacy side, can lesswrong guarantee that, if I never click on Recommended, then recombee will never see an (even anonymized) trace of what I browse on lesswrong?
Typo? Do you mean “click on Recommended”? I think the answer is no, in order to have recommendations for individuals (and everyone), they have browsing data.
1) LessWrong itself doesn’t aim for a super high degree of infosec. I don’t believe our data is sensitive to warrant large security overhead.
2) I trust Recombee with our data about as much as our trust ourselves to not have a security breach. Maybe actually I could imagine LessWrong being of more interest to someone or some group and getting attacked.
It might help to understand what your specific privacy concerns are.
I would feel better about this if there was something closer to (1) on which to discuss what is probably the most important topic in history (AI alignment). But noted.
Over the years the idea of a closed forum for more sensitive discussion has been raised, but never seemed to quite make sense. Significant issues included:
- It seems really hard or impossible to make it secure from nation state attacks
- It seems that members would likely leak stuff (even if it’s via their own devices not being adequately secure or what)
I’m thinking you can get some degree of inconvenience (and therefore delay), but hard to have large shared infrastructure that’s that secure from attack.
I am sad to see you getting so downvoted. I am glad you are bringing this perspective up in the comments.
(Emphasis mine.)
Here’s an idea[1] for a straightforward(?) recommendation algorithm: Quantilize over all past LessWrong posts by using inflation-adjusted karma as a metric of quality.
The advantage is that this is dogfooding on some pretty robust theory. I think this isn’t super compute-intensive, since the only thing one has to do is to compute the cumulative distribution function once a day (associating it with the post), and then inverse transform sampling from the CDF.
Recommending this way has the disadvantage of not being recency-favoring (which I personally like), and not personalized (which I also like).
By default, it also excludes posts below a certain karma threshold. That could be solved by exponentially tilting the distribution instead of cutting it off (θ>0, otherwise to be determined (experimentally?)). Such a recommendation algorithm wouldn’t be as robust against very strong optimizers, but since we have some idea what high-karma LessWrong posts look like (& we’re not dealing with a superintelligent adversary… yet), that shouldn’t be a problem.
If I was more virtuous, I’d write a pull request instead of a comment.
Personalization is easy to achieve while keeping the algorithm transparent. Just rank your own viewed/commented posts by most frequent tags, then score past posts based on the tags and pick a quantile based on the mixed upvotes/tags score, possibly with a slider parameter that allows you to adjust which of the two things you want to matter most.
Disappointing to see this is the approach y’all are taking to making ai tools for the site, but I guess it does make sense that you’d want to outsource it. I’d strongly appreciate a way to opt out of having my data sent off-site for this or any future reason.
I am pretty excited about doing something more in-house, but it’s much easier to get data about how promising this direction is by using some third-party services that already have all the infrastructure.
If it turns out to be a core part of LW, it makes more sense to in-house it. It’s also really valuable to have an relatively validated baseline to compare things to.
There are a bunch of third-party services we couldn’t really replace that we send user data to. Hex.tech as our analytics dashboard service. Google Analytics for basic user behavior and patterns. A bunch of AWS services. Implementing the functionality of all of that ourselves, or putting a bunch of effort into anonymizing the data is not impossible, but seems pretty hard, and Recombee seems about par for the degree to which I trust them to not do anything with that data themselves.
I’d like to opt out of all analytics. I believe the GDPR requires you to implement this?
GDPR is a giant mess, so it’s pretty unclear what it requires us to implement. My current understanding is that it just requires us to tell you that we are collecting analytics data if you are from the EU.
And the kind of stuff we are sending over to Recombee would be covered by it being data necessary to provide site functionality, not just analytics, so wouldn’t be covered by that (if you want to avoid data being sent to Google Analytics in-particular, you can do that by just blocking the GA script in uBlock origin or whatever other adblocker you use, which it should do by default).
drat, I was hoping that one would work. oh well. yes, I use ublock, as should everyone. Have you considered simply not having analytics at all :P I feel like it would be nice to do the thing that everyone ought to do anyway since you’re in charge. If I was running a website I’d simply not use analytics.
back to the topic at hand, I think you should just make a vector embedding of all posts and show a HuMAP layout of it on the homepage. that would be fun and not require sending data anywhere. you could show the topic islands and stuff.
My bet is if you were running a website like this you’d see how useful analytics are for making complex websites better.
I have been employed making websites like this many times before. The analytics were extremely useful. It would have been much harder without it. Also, I have come to the opinion that one should not use analytics despite this.
I’d like a pure-recommender view, so that I can easily tell when I am looking at recommender vs frontpage posts. I would use that more than a mixed view.
It’s the plan to have that live, only reason we didn’t deploy it on Thursday was we have to do a small bit of extra work to extend caching (to achieve acceptable performance) to the pure-recommender view. Probably have it up soon.
Oh, yeah, admins currently have access to a purely recommended view, and I prefer it. I would be in favor of making that accessible to users (maybe behind a beta flag, or maybe not, depending on uptake).
See comment.
We’ve had the choice of tabs up for a month now and the results so far are encouraging, or at least not discouraging. There are many users who are very pleased with the Recommendations, liking among other things that it brings to attention posts that otherwise get lost if you only see what’s new. Clickthrough-rates are higher for people using the Enriched/Recommendations tab, although this is most certainly a selection effect on the kind of user who changes tab at all. Switching some people over automatically is motivated by wanting to get a better signal here before doing something like changing the global default.
The current recommendations still needs more work though. People are much less likely to click on recommendations of posts that they’ve already clicked on, but it’s proving tricky to eliminate such recommendation entirely. Also the algorithm overwhelmingly recommends posts from the last year when we’d like to see it surfacing stuff from further back too. Still, Latest is overwhelming stuff from the Last week, so it’s still an improvement over the counterfactual.
--
From when we started the project, we’ve settled on the “hybrid” list being likely optimal as the default list people look at. Many people want to “keep up with the latest” even if they’re also interested in good posts from all time, so any recommended list of posts that’s the default has to have a heavy latest component. We first tried making two calls to the Recommendations API, one with heavy recency bias, but it was hard to get it consisted, so we switched to just splitting the list between the usual Latest algorithm and new recommendations algorithm.
This has the advantage that is preserves some of the “common knowledge” aspect of the current algorithm where you know which posts other people are seeing too, and an author knows that if they get upvoted, their post will be visible automatically and transparently to many people. As discussed elsethread on this post, we want to have a pure-recommendations tab as well and have been waiting on a bit of coding to make that happen.
--
People often have the fear of goodharting on the wrong metric (like clicks) for recommendation algorithms. I think we do need to keep an eye on that, and I want to build more analytics tools for detecting drift here, and more talking to people. I think as we fix up more basic issues like excluding read content and getting it to even recommend posts from older than a year ago[1], we’ll put more attention on is the trend good.
One guess I have is the algorithm is stuck for dumb “structural” reasons, in that it’s been given recent data which is overwhelmingly of people reading recent content, so when it queries “what’s good?” recent content comes out on top even without explicitly training that into the system.
I realized I hadn’t given feedback on the actual results of the recommendation algorithm. Rating the recommendations I’ve gotten (from −10 to 10, 10 is best):
My experience using financial commitments to overcome akrasia: 3
An Introduction to AI Sandbagging: 3
Improving Dictionary Learning with Gated Sparse Autoencoders: 2
[April Fools’ Day] Introducing Open Asteroid Impact: −6
LLMs seem (relatively) safe: −3
The first future and the best future: −2
Examples of Highly Counterfactual Discoveries?: 5
“Why I Write” by George Orwell (1946): −3
My Clients, The Liars: −4
‘Empiricism!’ as Anti-Epistemology: −2
Toward a Broader Conception of Adverse Selection: 4
Ambitious Altruistic Software Engineering Efforts: Opportunities and Benefits: 6
I’d be interested in a comparison with the Latest tab.
Transformers Represent Belief State Geometry in their Residual Stream: 6
D&D.Sci: −5
Open Thread Spring 2024: 3
Introducing AI Lab Watch: −3
An explanation of evil in an organized world: −3
Mechanistically Eliciting Latent Behaviors in Language Models: 3
Shane Legg’s necessary properties for every AGI Safety plan: −1
LessWrong Community Weekend 2024, open for applications: −6
Ironing Out the Squiggles: 5
ACX Covid Origins Post convinced readers: −7
Why I’m doing PauseAI: −2
Manifund Q1 Retro: Learnings from impact certs: −1
Questions for labs: −3
Refusal in LLMs is mediated by a single direction: 5
Take SCIFs, it’s dangerous to go alone: 4
I’m enjoying having old posts recommended to me. I like the enriched tab.
Mindblowing moment: It has been a private pet peeve of mine that it was very unclear what policy I should follow for voting.
In practice, I vote mostly on vibes (and expect most people to), but given my own practices for browsing LW, I also considered alternative approaches.
- Voting in order to assign a specific score (weighted for inflation by time and author) to the post. Related uses: comparing karma of articles, finding desirable articles on a given topic.
- Voting in order to match an equivalent-value article. Related uses: same; perhaps effective as a community norm but more effortful.
- Voting up if the article is good, down if it’s bad (after memetic/community/bias considerations) (regardless of current karma). Related uses: karma as indicator of community opinion.
In the end, making my votes consistent turned out to be too much effort in every case for extensive calculations, which is why I came back to vibes, amended by implicit considerations of consistent ways to vote.
I was trying to figure out ways to vote which would put me in a class of voters that marginally improved my personal browsing experience.
It never occurred to me to model the impact it would have on others and to optimize for their experience.
This sounds like an obviously better way to vote.
So for anyone who was in the same case as me, please optimize for others’ browsing experience (or your own) directly rather than overcalculate decision-theoretic whatevers.
Suggestion: A marker for recommended posts which are over x duration old. I was just reading this post which was recommended to me, and got half-way through before seeing it was 2 years out of date :(
https://www.lesswrong.com/posts/3S4nyoNEEuvNsbXt8/common-misconceptions-about-openai
(Or maybe it’s unnecessary and I’ll get used to checking post dates on the algorithmic frontpage)
We are experimenting with bolding the date on posts that are new and leaving it thinner on posts that are old, though feedback so far hasn’t been super great.