Posting my idea from irc here too. We should look for ways to make the claims of this post more concrete and testable. I propose crawling the site to create a LW citation index. We can then make measurements—which new posts are picked up by the LW community? Does everyone always refer back to EY, or do we talk about the new stuff? etc
I propose crawling the site to create a LW citation index.
This. From the citation matrix it would be easy to calculate internal LessWrong “Page Rank” and find the most influential pages. The only problem is, once this methodology is used and known, people will start behaving differently.
There are some technical details about the exact choice of the model. Simplest version would analyze only articles: each article has the same initial value (karma is ignored), and only links from article to article are considered (links from comments are ignored). (Multiple links from page A to page B are treated as a single link.) This is easiest to do.
If we want to include the karma, the simplest way would be to treat articles with zero or negative karma as non-existent (remove them from the model). I am not completely certain how to treat higher karma. If I understand it correctly, Google Page Rank simulates a random user who with probability 85% clicks a random link on the page, and with probability 15% chooses a new starting page from a uniform distribution—we could replace the uniform distribution with a weighted distribution where positive karma is the weight of the page. I am not sure how much the results would be sensitive to the “15%” value.
If we want to include comments in the model, considering their karma is IMHO inevitable, otherwise it would be too easy to game the system (by writing new comments to highly ranked pages). But the comments are not new nodes in the graph (that wouldn’t work, because to an average comment nobody is linking; and it would be wrong to treat a comment in the article as a comment linked by article), so perhaps they could be treated as a part of the article. A link from the comment would be like a link from the article, just weaker. How exactly weaker, that is determined by the comment karma compared with the article karma. For example a link in a 5 karma comment below a 20 karma article would be treated like a 0.25 link. (If the same link is in more comments, only the best weight is taken. If the comment has higher karma than the article, the link strenght is capped at 1.0.)
Here is the pseudocode:
ARTICLES = articles with karma > 0 TOTALKARMA = sum(A.karma) for each article A in ARTICLES for each article A in ARTICLES: … for each hyperlink H in A: … … LINKS(A, H.target) = 1.0 … for each comment C with karma > 0 in A: … … for each hyperlink H in C: … … … CLINK = min(1.0, C.karma/A.karma) … … … LINKS(A, H.target) = max(LINKS(A, H.target), CLINK) … TOTALLINKS(A) = sum(LINKS(A, A2)) for each article A2 in ARTICLES for each article A1, A2 in ARTICLES: … RANKFLOW(A1, A2) = LINKS(A1, A2) / TOTALLINKS(A1) # where 0.0 / 0.0 = 0.0 for each A in ARTICLES: … RANK(A) = A.karma / TOTALKARMA repeat many times: … for each article A in AS: … … NEWRANK(A) = 0.15 A.karma/TOTALKARMA … … NEWRANK(A) += 0.85 sum(RANKFLOW(A2, A)) for each article A2 in ARTICLES … RANK = NEWRANK
Posting my idea from irc here too. We should look for ways to make the claims of this post more concrete and testable. I propose crawling the site to create a LW citation index. We can then make measurements—which new posts are picked up by the LW community? Does everyone always refer back to EY, or do we talk about the new stuff? etc
This. From the citation matrix it would be easy to calculate internal LessWrong “Page Rank” and find the most influential pages. The only problem is, once this methodology is used and known, people will start behaving differently.
There are some technical details about the exact choice of the model. Simplest version would analyze only articles: each article has the same initial value (karma is ignored), and only links from article to article are considered (links from comments are ignored). (Multiple links from page A to page B are treated as a single link.) This is easiest to do.
If we want to include the karma, the simplest way would be to treat articles with zero or negative karma as non-existent (remove them from the model). I am not completely certain how to treat higher karma. If I understand it correctly, Google Page Rank simulates a random user who with probability 85% clicks a random link on the page, and with probability 15% chooses a new starting page from a uniform distribution—we could replace the uniform distribution with a weighted distribution where positive karma is the weight of the page. I am not sure how much the results would be sensitive to the “15%” value.
If we want to include comments in the model, considering their karma is IMHO inevitable, otherwise it would be too easy to game the system (by writing new comments to highly ranked pages). But the comments are not new nodes in the graph (that wouldn’t work, because to an average comment nobody is linking; and it would be wrong to treat a comment in the article as a comment linked by article), so perhaps they could be treated as a part of the article. A link from the comment would be like a link from the article, just weaker. How exactly weaker, that is determined by the comment karma compared with the article karma. For example a link in a 5 karma comment below a 20 karma article would be treated like a 0.25 link. (If the same link is in more comments, only the best weight is taken. If the comment has higher karma than the article, the link strenght is capped at 1.0.)
Here is the pseudocode:
ARTICLES = articles with karma > 0
TOTALKARMA = sum(A.karma) for each article A in ARTICLES
for each article A in ARTICLES:
… for each hyperlink H in A:
… … LINKS(A, H.target) = 1.0
… for each comment C with karma > 0 in A:
… … for each hyperlink H in C:
… … … CLINK = min(1.0, C.karma/A.karma)
… … … LINKS(A, H.target) = max(LINKS(A, H.target), CLINK)
… TOTALLINKS(A) = sum(LINKS(A, A2)) for each article A2 in ARTICLES
for each article A1, A2 in ARTICLES:
… RANKFLOW(A1, A2) = LINKS(A1, A2) / TOTALLINKS(A1) # where 0.0 / 0.0 = 0.0
for each A in ARTICLES:
… RANK(A) = A.karma / TOTALKARMA
repeat many times:
… for each article A in AS:
… … NEWRANK(A) = 0.15 A.karma/TOTALKARMA
… … NEWRANK(A) += 0.85 sum(RANKFLOW(A2, A)) for each article A2 in ARTICLES
… RANK = NEWRANK
Has this gone anywhere?
As far as I know, no.