Another problem with your methodology is that prolific users typically have many pages associated with them that lack the text “submitted by” or “comments by” on them. You can access these pages by going the user’s main page, scrolling down, and clicking the little “Next” link in the lower left.
Multiple pages aren’t being counted. From what I understand, Google doesn’t just follow dynamically generated next links like that. It spiders, going around in a web-like pattern. How many times would it end up visiting the same pages if it followed every comment to it’s original discussion? A lot. That would be a waste of resources.
To test this, I looked at the url that appears when you press the next button. The site adds some pagination variables into the URL. The word “count” appears. So, you can do the following query and observe the following things:
A. It does not divide the number of results into a small fraction of the original number like you’d expect it to. We’re comparing 9,820 total users with the original method (at this moment) with 9,460.
B. Removing “com” from the query shows zero results which verifies that adding -count would be removing pages generated in those next links, had they been included.
C. If you click on random pages of Google results, you won’t see those count and after variables in the URLs (Or at least I didn’t and I feel fairly confident that they won’t be there.)
D. If Vladmir is correct in this post then just looking at one of those lines where the user’s comments are totaled (the line where 900 have 25 comments) reveals that, by removing “count” from the query, we should have lost at least 1800 from the total. Nowhere near that many were lost, and a lot more should have been lost than that because I only subtracted a tiny fraction of the comments pages on this site in the example.
There are lots of lurkers on Less Wrong:
http://lesswrong.com/lw/1np/attention_lurkers_please_say_hi/
Another problem with your methodology is that prolific users typically have many pages associated with them that lack the text “submitted by” or “comments by” on them. You can access these pages by going the user’s main page, scrolling down, and clicking the little “Next” link in the lower left.
Multiple pages aren’t being counted. From what I understand, Google doesn’t just follow dynamically generated next links like that. It spiders, going around in a web-like pattern. How many times would it end up visiting the same pages if it followed every comment to it’s original discussion? A lot. That would be a waste of resources.
To test this, I looked at the url that appears when you press the next button. The site adds some pagination variables into the URL. The word “count” appears. So, you can do the following query and observe the following things:
site:lesswrong.com/user—”submitted by” -”comments by” -count
site:lesswrong.com/user—”submitted by” -”comments by” -com (for comparison)
And observe:
A. It does not divide the number of results into a small fraction of the original number like you’d expect it to. We’re comparing 9,820 total users with the original method (at this moment) with 9,460.
B. Removing “com” from the query shows zero results which verifies that adding -count would be removing pages generated in those next links, had they been included.
C. If you click on random pages of Google results, you won’t see those count and after variables in the URLs (Or at least I didn’t and I feel fairly confident that they won’t be there.)
D. If Vladmir is correct in this post then just looking at one of those lines where the user’s comments are totaled (the line where 900 have 25 comments) reveals that, by removing “count” from the query, we should have lost at least 1800 from the total. Nowhere near that many were lost, and a lot more should have been lost than that because I only subtracted a tiny fraction of the comments pages on this site in the example.
I’m pretty sure Google normally does follow dynamic links. In this case, though, it doesn’t, since they are marked
nofollow
.