jwhendy comments on 96 Bad Links in the Sequences

jwhendy Apr 7, 2011, 2:31 PM
0 points
Unrelated to my other comment, I’d be interested in what you’re doing to scrape. In the cases I’ve wanted LW articles, I’ve been using wget and then a bash script to go change out all of the obvious html stuff into a different markup form...

If you have already been doing something like this, I would be interested in how you’re parsing (post-processing) your “scrapes.” Or perhaps you’re not and just using ping or wget or something similar to iterate through everything?
- Alexandros Apr 7, 2011, 2:47 PM
  2 points
  Parent
  I’ve got a scraper up on scraperwiki. (which means it is being rerun automatically on a periodic basis). Check here. You can see the python source in the edit tab. It ain’t pretty, but it works. The post-processing is mostly with lxml. You can also download the data as csv directly from the linked page, and you can run arbitrary sql queries on the data from here. Not sure if that covers it, happy to answer any questions.
  - jwhendy Apr 7, 2011, 4:27 PM
    0 points
    Parent
    Well, you’re more advanced than me! I really, really, really want to learn python. Seeing it used is just more inspiration… but oh so much else to do as well.
    
    I just hobble along with bash tools and emacs regexp hunting.
    
    I know some java, so I can somewhat follow most code, but the methods available in lxml are unknown to me—it looks like that gives you quite the abilities when trying to digest something like this? For me, it’s been using wget and then trying to figure out what tags I need to sed out of things… and LW seems a bit inconsistent. Sometimes it’s and sometimes it’s or whatever it is.
    
    Very interesting work!
    - Alexandros Apr 7, 2011, 4:51 PM
      2 points
      Parent
      lxml is a bit of a mindtwister and I only know as much as I need to as more advanced things require XPath. If you’re trying to get your head around how all this works, I suggest taking a look at my other two scrapers which are considerably simpler:
      
      http://scraperwiki.com/scrapers/commonsenseatheism_-_atheism_vs_theism_debates/ http://scraperwiki.com/scrapers/hp_fanfic_reviews/
      
      As I learn more I take on more challenging tasks which leads to more complex scrapers, but if you know java and regex, python should be a breeze. I don’t mind answering specific questions or talking on skype if you want to go through the code live. Duplication of effort is a pet peeve of mine, and using scraperwiki/python/lxml has been like having a new superpower I’d love to share. Don’t hesitate to ask if you’re willing to invest some time.
      - jwhendy Apr 7, 2011, 5:38 PM
        0 points
        Parent
        
        Don’t hesitate to ask if you’re willing to invest some time.
        
        Deal! I’ll read some about this and look into it more. I’m interested in this in that it seems like it’s somehow… welll… “scraping” without digging through the actual html? Or is that not right? I have to all kinds of dumb stuff to the raw html, where as this seems like you’re able to just tell it, “Get td[0] and store it as =variable= for all the tr’s.”
        
        It’s pretty slick. But… maybe the method itself is actually digging through the raw html and collecting stuff that way. Not sure.
        Alexandros Apr 8, 2011, 6:14 AM
        0 points
        Parent
        Yeah, lxml processes all the html into a tree and gives you an API so you can access it as you like. It takes a lot of the grunt work out of extracting data from HTML.
        jwhendy Apr 8, 2011, 1:23 PM
        0 points
        Parent
        Which is awesome, as I just felt the pain of hand pruning a heckuva lot of html tags out of something I wanted to transform to a different format. Even with my find-replacing, line breaks would prevent the tag from getting detected fully and I had to do a lot of tedious stuff :)