I’ve been scraping data from the sequences recently, where by sequences I mean all of Eliezer’s posts up to and including Practical Advice Backed By Deep Theories. I’ve been doing this mostly to get some fun data out and maybe some more useful things like the Bring Back the Sequences project, but one of the things I found is that there is breakage from the move from OB (and OB’s subsequent reorganization) that remains unfixed.
In particular, 96 links either give 404s (not found), used to link to a comment but now only link to the main article, or link under the summary fold for no apparent reason. To avoid overloading this article, I have posted the list on piratepad here:
Note that I have only checked links that went to overcomingbias.com. This is not necessarily a complete list.
Some of these can be fixed by anyone with editing rights, but the ones pointing to comments can be fixed only by Eliezer or someone who knows what comment was meant to be linked. Alternatively, someone can go through the archive.org WayBack machine, figure out which comments were linked to, then find them in the equivalent LessWrong page, and finally provide the corrected link. I may modify the scraper to do this if someone is willing to make the substitution.
Also, a bunch of links (not in the above list) direct the user to OvercomingBias.com only to be redirected back to LessWrong. While this doesn’t actually cause any breakage, it’s a pity to be burdening OB’s server for no real reason. I can produce a list of these if needed.
If I have managed to attract the attention of anyone with editorial rights, I would really appreciate it if you could help me out by removing certain formatting inconsistencies that greatly slow down and complicate my scraper. I can offer more details on demand, but these links to OB are near the top of the list.
I should be back with more interesting data soon. If you have any particular data-mineable queries about the sequences, let me know.
[Edit: The 4 links that point to a #comments fragment are actually processed correctly. That leaves 92 to be fixed.]
96 Bad Links in the Sequences
I’ve been scraping data from the sequences recently, where by sequences I mean all of Eliezer’s posts up to and including Practical Advice Backed By Deep Theories. I’ve been doing this mostly to get some fun data out and maybe some more useful things like the Bring Back the Sequences project, but one of the things I found is that there is breakage from the move from OB (and OB’s subsequent reorganization) that remains unfixed.
In particular, 96 links either give 404s (not found), used to link to a comment but now only link to the main article, or link under the summary fold for no apparent reason. To avoid overloading this article, I have posted the list on piratepad here:
http://piratepad.net/ep/pad/view/ro.eyxCVZYMZeO/latest
Note that I have only checked links that went to overcomingbias.com. This is not necessarily a complete list.
Some of these can be fixed by anyone with editing rights, but the ones pointing to comments can be fixed only by Eliezer or someone who knows what comment was meant to be linked. Alternatively, someone can go through the archive.org WayBack machine, figure out which comments were linked to, then find them in the equivalent LessWrong page, and finally provide the corrected link. I may modify the scraper to do this if someone is willing to make the substitution.
Also, a bunch of links (not in the above list) direct the user to OvercomingBias.com only to be redirected back to LessWrong. While this doesn’t actually cause any breakage, it’s a pity to be burdening OB’s server for no real reason. I can produce a list of these if needed.
If I have managed to attract the attention of anyone with editorial rights, I would really appreciate it if you could help me out by removing certain formatting inconsistencies that greatly slow down and complicate my scraper. I can offer more details on demand, but these links to OB are near the top of the list.
I should be back with more interesting data soon. If you have any particular data-mineable queries about the sequences, let me know.
[Edit: The 4 links that point to a #comments fragment are actually processed correctly. That leaves 92 to be fixed.]