Paul Crowley comments on Print ready version of The Sequences

Paul Crowley 20 Nov 2010 16:45 UTC
9 points
I’ve now written a fairly sophisticated scraper for Eliezer’s blog posts based on lxml, which
- follows the Author links in “Article Navigation” to fetch all articles
- fetches and parses all articles
- identifies the title, body, and date
- fixes hrefs to internal references where possible, including where the reference is to Overcoming Bias and redirects back to Less Wrong.
- fixes all the weird Unicode characters as best I can where I can make a plausible guess
- finds and adds the forward references in all blog posts
- caches all network operations in a very simple dumb way
- writes them all out as very simple HTML with a very simple HTML contents page, in a form that Calibre works well on.
I’ll share the script when I have time to sort out publishing via Mercurial, or email me if you’d like a snapshot copy—paul at ciphergoth dot org.
- multifoliaterose 20 Nov 2010 17:10 UTC
  0 points
  Parent
  Great to hear!