jwhendy comments on 96 Bad Links in the Sequences

jwhendy Apr 7, 2011, 5:38 PM
0 points

Don’t hesitate to ask if you’re willing to invest some time.

Deal! I’ll read some about this and look into it more. I’m interested in this in that it seems like it’s somehow… welll… “scraping” without digging through the actual html? Or is that not right? I have to all kinds of dumb stuff to the raw html, where as this seems like you’re able to just tell it, “Get td[0] and store it as =variable= for all the tr’s.”

It’s pretty slick. But… maybe the method itself is actually digging through the raw html and collecting stuff that way. Not sure.
- Alexandros Apr 8, 2011, 6:14 AM
  0 points
  Parent
  Yeah, lxml processes all the html into a tree and gives you an API so you can access it as you like. It takes a lot of the grunt work out of extracting data from HTML.
  - jwhendy Apr 8, 2011, 1:23 PM
    0 points
    Parent
    Which is awesome, as I just felt the pain of hand pruning a heckuva lot of html tags out of something I wanted to transform to a different format. Even with my find-replacing, line breaks would prevent the tag from getting detected fully and I had to do a lot of tedious stuff :)