Brotherzed comments on The Web Browser is Not Your Client (But You Don’t Need To Know That)

Brotherzed 25 Apr 2016 22:50 UTC
2 points

But consider the following problem: Find and display all comments by me that are children of this post, and only those comments, using only browser UI elements, i.e. not the LW-specific page widgets. You cannot—and I’d be pretty surprised if you could make a browser extension that could do it without resorting to the API, skipping the previous elements in the chain above. For that matter, if you can do it with the existing page widgets, I’d love to know how.

If you mean parse the document object model for your comments without using an external API, it would probably take me about a day, because I’m rusty with WatiN (the tool I used to used for web scraping when that was my job a couple years ago). About four hours of that would be setting up an environment. If I was up to speed, maybe a couple hours to work out the script. Not even close to hard compared to the crap I used to have to scrape. And I’m definitely not the best web scraper; I’m a non-amateur novice, basically. The basic process is this: anchor to a certain node type that is the child of another node with certain attributes and properties, and then search all the matching nodes for your user name, then extract the content of some child nodes of all the matched nodes that contain your post.

WatiN:: http://watin.org/

Selenium: http://www.seleniumhq.org/

These are the most popular tools in the Microsoft ecosystem.

As someone who has the ability to control how content is displayed to me (tip—hit f12 in google chrome), I disagree with the statement that a web browser is not a client. It is absolutely a client and if I were sufficiently motivated I could view this page in any number of ways. So can you. Easy examples you can do with no knowledge are to disable the CSS, disable JS, etc.
- Vaniver 28 Apr 2016 17:36 UTC
  0 points
  Parent
  Sure, and if I want a karma histogram of all of my posts I can scrape my user page and get them. But that requires moving a huge amount of data from the server to me to answer a fairly simple question, which we could have computed on the server and then moved to me more cheaply.
  - Brotherzed 10 May 2016 20:49 UTC
    0 points
    Parent
    There’s no extra load on the server; you’re just parsing what the page already had to send you. If your goal is just to see the web page and not data collection, it’s a different solution but also feasible.
    
    What you can do is create a simple browser plugin that injects jQuery into the page to get all the comments by a name. I’ll go into technical details a bit—Inject an extra version of jQuery into the page (that you know always uses the same code, in case lesswrong changes their version of jQuery). Then use JQuery selectors to anchor to all your posts using a technique similar to the one I described for the scraper. Then transform the page to consist of nothing but the anchored comments you acquired via Jquery.
    
    You could make this a real addon where you push a button in the top right of your chrome browser, type a username, and then you see nothing but all the posts by that user on a given page.
    
    Same principle as Adblock plus or other browser addons.
    - Vaniver 11 May 2016 12:38 UTC
      0 points
      Parent
      
      There’s no extra load on the server; you’re just parsing what the page already had to send you.
      
      If I look at 200 comments pages, doesn’t that require the server processing my request and sending me the comments page 200 times? Especially if telling it something like “give me 10 comments by user X after comment abc” means that it’s running a SQL query that compares the comment id to abc.
      
      I do agree that there are cool things you can do to manipulate comments on a page.
      - Brotherzed 23 May 2016 16:30 UTC
        0 points
        Parent
        
        If I look at 200 comments pages, doesn’t that require the server processing my request and sending me the comments page 200 times?
        
        As for finding your comments regardless of the thread they are on, that is already a feature of Reddit’s platform—click on your username, then click “comments” to get to the LW implementation of that feature.
        
        Regardless, that isn’t what you were describing earlier. It would not put extra load on the server to have jQuery transform this thread, which has all the comments, to show only your comments on the thread. It’s a client-side task. That’s what you originally said was not feasible.
        
        All this talk has actually made me consider writing an addon that makes slashdot look clean and in-line like LW, Reddit, Ycombinator, etc.
        Vaniver 23 May 2016 23:57 UTC
        0 points
        Parent
        
        That’s what you originally said was not feasible.
        
        Are you confusing me with Error? What I said was inefficient was writing a scraper to get my karma histogram, on every comment (well, I wrote post) that I’ve ever written.
        
        All this talk has actually made me consider writing an addon that makes slashdot look clean and in-line like LW, Reddit, Ycombinator, etc.
        
        I do think that’d be a cool tool to have (though I don’t use Slashdot).
- Error 28 Apr 2016 16:43 UTC
  0 points
  Parent
  Upvoted for actually considering how it could be done. It does sort of answer the letter if not the spirit of what I had in mind.
  - Brotherzed 10 May 2016 20:52 UTC
    0 points
    Parent
    I admit I didn’t think it all the way through. If your goal isn’t ultimately data collection, you would make a browser addon and use javascript injection (the frontend scripting language for rendering web pages). I replied to another person with loose technical details, but you could create a browser addon where you push a button in the top right corner of your browser, type a username, and then it transforms the page to show nothing but posts by the user of that page by leveraging the web page’s frontend scripting language.
    
    So there’s a user-friendly way to transform your browser’s rendering without APIs, clunky web scrapers or excess server load. It’s basically the same principle that adblockers work on.