I want comments on my social media crossposts to show up on my blog as a comment section, and mostly this works well: modern systems ( Mastodon, Bluesky, LessWrong, etc) provide APIs where you can load the replies associated with a post. On the other hand, older systems like Facebook are more locked down: they want to keep your content inside the platform as part of their economic moat.
Still, Facebook will show all the comments on a post to anyone who visits it, even if logged out. You have to dismiss a popup and click “show more” and “see replies” a few times, but it’s all public. At times I’ve written scripts to export the comments, but they’re quite brittle: Facebook doesn’t design their pages to be easy to scrape, and so my code has relied on incidental things that only happen to work.
Even though this is not a permanent solution, I’ve had another go at writing a comment exporter (code). It’s not as thorough as past times: I couldn’t figure out easy was to get the timestamp or links to the comment on Facebook, and I’ve left both out. I also had to switch my opt-out from working on user id to user name, which is less robust. But it works! I’ve gone back through June 2019, fetching comments for any posts where I was missing them.
Are you looking for comments/review, or just showing your work?
Sharing the code in case others are curious, but if you have suggestions on how to do it better I’d be curious to hear them!
I don’t know about better, but there are some ways to DRY up your code a bit with some higher-order functions and, eliminate nested
if
statements that I tend to have more trouble following. Your approach is pretty straightforward, and mostly my suggestions will be minor refactors at best.I’ll go through it tonight, and post it as a pull request, that way it will be easier to discuss any particular line.
One general suggestion I have is to write some test code that can notify you when something breaks. Since you are forced to rely on a brittle solution, knowing when it eventually breaks will helpful for the future.
I’m happy to look at a PR, but I think I’m unlikely to merge one that’s minor refactors: I’ve evaluated the current code through manual testing, and if I were going to make changes to it I’d need another round of manual testing to verify it still worked. Which isn’t that much work, but the benefit is also small.
It’s reasonably fast for me to evaluate it manually: pick a post that should have some comments (including nested ones) and verify that it does in fact gather them. Each time it runs it tells me the number of comments it found (via the title bar) and this is usually enough for me to tell if it is working.
I think this is an unusually poor fit for automated tests? I don’t need to keep the code functional while people other than the original author work on it, writing tests won’t keep dependencies from breaking it, the operation is simple enough for manual evaluation, the stakes are low, and it’s quite hard to make a realistic test environment.
I totally got carried away. 😅
[Here’s what I did](https://github.com/jeffkaufman/comments-selenium/pull/1). I don’t even know if it is a real suggestion anymore. Maybe you’ll find inspiration from some of it. Maybe not.
Interesting to read through! Thoughts:
I really don’t like the no-semicolons JS style. I’ve seen the arguments that it’s more elegant, but a combination of “it looks wrong” and “you can get very surprising bugs in cases where the insertion algorithm doesn’t quite match our intuitions” is too much.
What’s the advantage of making
alreadyClicked
a set instead of keeping it as a property of the things it’s clicking on?In this case I’m not at all worried about memory leaks, since the tab will only exist for a couple seconds.
The
getExpandableComments
simplification is nice!I haven’t tested it, but I think your
collectComments
has a bug in it where it will include replies as if they are top level comments in addition to including them as replies to the appropriate top level comments.