Alexandros comments on 96 Bad Links in the Sequences

Alexandros Apr 7, 2011, 4:51 PM
2 points
lxml is a bit of a mindtwister and I only know as much as I need to as more advanced things require XPath. If you’re trying to get your head around how all this works, I suggest taking a look at my other two scrapers which are considerably simpler:

http://scraperwiki.com/scrapers/commonsenseatheism_-_atheism_vs_theism_debates/ http://scraperwiki.com/scrapers/hp_fanfic_reviews/

As I learn more I take on more challenging tasks which leads to more complex scrapers, but if you know java and regex, python should be a breeze. I don’t mind answering specific questions or talking on skype if you want to go through the code live. Duplication of effort is a pet peeve of mine, and using scraperwiki/python/lxml has been like having a new superpower I’d love to share. Don’t hesitate to ask if you’re willing to invest some time.
- jwhendy Apr 7, 2011, 5:38 PM
  0 points
  Parent
  
  Don’t hesitate to ask if you’re willing to invest some time.
  
  Deal! I’ll read some about this and look into it more. I’m interested in this in that it seems like it’s somehow… welll… “scraping” without digging through the actual html? Or is that not right? I have to all kinds of dumb stuff to the raw html, where as this seems like you’re able to just tell it, “Get td[0] and store it as =variable= for all the tr’s.”
  
  It’s pretty slick. But… maybe the method itself is actually digging through the raw html and collecting stuff that way. Not sure.
  - Alexandros Apr 8, 2011, 6:14 AM
    0 points
    Parent
    Yeah, lxml processes all the html into a tree and gives you an API so you can access it as you like. It takes a lot of the grunt work out of extracting data from HTML.
    - jwhendy Apr 8, 2011, 1:23 PM
      0 points
      Parent
      Which is awesome, as I just felt the pain of hand pruning a heckuva lot of html tags out of something I wanted to transform to a different format. Even with my find-replacing, line breaks would prevent the tag from getting detected fully and I had to do a lot of tedious stuff :)

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer