ike comments on Open thread, Feb. 9 - Feb. 15, 2015

ike 10 Feb 2015 21:23 UTC
2 points

But it’s simple and robust compared to some sort of in-browser extension which saves the rendered DOM in the background.

It’s not robust for saving things like private chats, or anything else you need to be logged in to see. If I read your page correctly you’d need to do each of those manually. Unless the program can automatically take the cookies from a browser, and even then not everything gets saved. I’d want something that saved every element that was downloaded to my computer.

Also, if the page gets deleted fast, your program may miss it.

I’ve never needed to prove that. My concern is usually having a copy at all, and the IA is trusted enough that it’s a de facto proof.

I anticipate needing that, or at least finding it useful. Did you know the Internet Archive will delete based on a request by the website? I’m dealing with a specific domain where people are forging screenshots to prove their side, and something like this would come in handy, I think. Some of the sites are also deleting posts fast, which makes it hard to archive on a schedule.

find a download tool which will save the raw bit-level TCP/IP stream when you download a page and save it to an appropriate format like a PCAP file (maybe Wireshark supports this or it could be hooked into something like wget?); this preserves the HTTPS encryption and allows proving that the content came signed with the domain’s key (not that this means much), since the crypto can be checked to be valid and the stream replayed.

Why doesn’t it mean much?

For timestamping, doesn’t the TCP protocol have timestamps in it, or are those not signed or something? Also, many pages have timestamps embedded in them.

We do have different uses for this, obviously.

Would a proxy be easier to set up, and if so, how would I do that to cache all results?

If there was a program that functioned like I wanted it to, would you prefer it over your solution?
- gwern 11 Feb 2015 4:48 UTC
  4 points
  Parent
  
  It’s not robust for saving things like private chats, or anything else you need to be logged in to see. If I read your page correctly you’d need to do each of those manually.
  
  There are always edge-cases. A simple version of my solution can be coded up and fully implemented in an hour or less by a normal programmer (the hardest part is writing the SQL line for pulling out URLs from Firefox); the full version (a bot or daemon) could probably be done in only a few hours more.
  
  Your desired solution, on the other hand, requires intimate familiarity with browser extensions and internals (if you want to save dynamic content and fancy things like Javascript-based chats, so much for trying to leverage existing solutions like the Mozilla Archive Format extension!).
  
  Pareto.
  
  Did you know the Internet Archive will delete based on a request by the website?
  
  My understanding is that in all cases, these deletions are really more ‘marking private’, and if it’s done via robots.txt, well, one day that robots.txt may be gone.
  
  Some of the sites are also deleting posts fast, which makes it hard to archive on a schedule.
  
  Note the on-demand archiving services used by my archive bot, discussed in my page...
  
  For timestamping, doesn’t the TCP protocol have timestamps in it, or are those not signed or something?
  
  I’m not sure. It’s possible that the packets have timestamps, but the encrypted content does not, in which case you don’t get provable timestamping: the HTTPS encryption can be verified, but one could have modified the packets to read any timestamps one pleases because they’re ‘around’ the encryption, not ‘in’ it. If it does, then maybe you don’t need explicit trusted timestamping, but if it doesn’t (or you want to work with any other data sources which don’t luckily have timestamps built in just right) then the Bitcoin solution would work.
  
  Also, many pages have timestamps embedded in them.
  
  Now who’s satisficing.
  
  If there was a program that functioned like I wanted it to, would you prefer it over your solution?
  
  I would consider it, but I would be somewhat reluctant to switch because I wouldn’t trust the tool to not break horribly at some point.