I’ve now written a fairly sophisticated scraper for Eliezer’s blog posts based on lxml, which
follows the Author links in “Article Navigation” to fetch all articles
fetches and parses all articles
identifies the title, body, and date
fixes hrefs to internal references where possible, including where the reference is to Overcoming Bias and redirects back to Less Wrong.
fixes all the weird Unicode characters as best I can where I can make a plausible guess
finds and adds the forward references in all blog posts
caches all network operations in a very simple dumb way
writes them all out as very simple HTML with a very simple HTML contents page, in a form that Calibre works well on.
I’ll share the script when I have time to sort out publishing via Mercurial, or email me if you’d like a snapshot copy—paul at ciphergoth dot org.
Great to hear!
I’ve now written a fairly sophisticated scraper for Eliezer’s blog posts based on lxml, which
follows the Author links in “Article Navigation” to fetch all articles
fetches and parses all articles
identifies the title, body, and date
fixes hrefs to internal references where possible, including where the reference is to Overcoming Bias and redirects back to Less Wrong.
fixes all the weird Unicode characters as best I can where I can make a plausible guess
finds and adds the forward references in all blog posts
caches all network operations in a very simple dumb way
writes them all out as very simple HTML with a very simple HTML contents page, in a form that Calibre works well on.
I’ll share the script when I have time to sort out publishing via Mercurial, or email me if you’d like a snapshot copy—paul at ciphergoth dot org.
Great to hear!