I’ve seen that. He basically looks through his history with a script and then wgets it, as well as submitting to archive systems. That’s both wasteful on bandwidth as everything is downloaded twice, and anything not public needs to be done manually with cookies. He also can’t prove that they came from a site even if it used https.
I was thinking something like a browser extension that just made sure nothing downloaded was ever deleted. I wonder if chrome has a hook for when it internally deletes something, that a program could instead copy it and convert it to some format?
That’s both wasteful on bandwidth as everything is downloaded twice, and anything not public needs to be done manually with cookies.
But it’s dead-simple and robust compared to some sort of in-browser extension which saves the rendered DOM in the background.
He also can’t prove that they came from a site even if it used https.
I’ve never needed to prove that. My concern is usually having a copy at all, and the IA is trusted enough that it’s a de facto proof.
But it’s possible:
find a download tool which will save the raw bit-level TCP/IP stream of packets when you download a page and save it to an appropriate format like a PCAP file (maybe Wireshark supports this or it could be hooked into something like wget?); this preserves the HTTPS encryption and allows proving that the content came signed with the domain’s key (not that this means much), since the crypto can be checked to be valid and the stream replayed.
This doesn’t prove it came from the domain at a particular time, but you can get timestamping using a trusted timestamping service such as the Bitcoin blockchain: as soon as the PCP file is closed and the webpage downloaded, take the hash and send a satoshi to the equivalent Bitcoin address. Transaction fees mean that this might get expensive to do for each web page, but there are a lot of ways you could optimize this (such as batching up an entire day’s worth of downloads into a single tarball archive, and timestamping that; big savings but also more granular timestamp; there are other schemes). Now you can prove the content came from someone with a particular HTTPS public key and that you downloaded it on or before a particular hour and date (roughly when the Bitcoin block with your transaction will be mined).
If anyone doubts you, they can take the relevant tarball, verify the hash and timestamp to when you say it was, then extract the relevant PCAP, verify the encryption, and then the displayed content.
Simply sniffing https packets is worthless. At most it tells you the length of the session. It’s exactly the adversary that TLS is designed to foil. You need the session key to decrypt it, so, you need to hook into the TLS implementation.
‘TLSnotary’ allows a client to provide evidence to a third party auditor that certain web traffic occurred between himself and a server. The evidence is irrefutable as long as the auditor trusts the server’s public key. The remainder of this paper describes how TLSnotary allows the auditee to conduct an https session normally with a web server such that the auditor can verify some part of that session (e.g. a single HTML page), by temporarily withholding a small part of the secret data used to set up the https session. The auditee does not at any time reveal any of the session keys to the auditor or anyone else, nor does he render or decrypt any data without authentication. Thus the full security model of the TLS 1.0 session is maintained, modulo some reduction in the entropy of the secrets used to protect it. Notes to the reader: As of this writing, TLSnotary is only compatible with TLS 1.0 and 1.1, not TLS 1.2
...In summary, the purpose of this rather complex sequence of steps is: the auditor withholds some of the secret data from the auditee (acting as client), so that the auditee cannot fabricate traffic from the server (since at the time of making his request, he does not have the server mac write secret). Once the auditee has a made a commitment to the encrypted content of the server’s response to his request, the auditor can provide the auditee with the required secret data in order to construct the server mac write secret. Then, the auditee can safely complete the decryption and authentication steps of the TLS protocol, since at that point he has the full master secret. In this way, the auditee maintains the full TLS security model, although he was prevented from creating a fake version of the post-handshake traffic from the server—something he is always able to do if he has the full master secret in advance.
Seems it requires an active and online auditor server (which is far from ideal), but if someone were to run such a trusted auditor, then you get your HTTPS provability and can timestamp it as before.
Or make your own CA, install that certificate in your browser and MITM the connection. Probably easier than changing your browser’s code; can easily be done all on a single system.
But it’s simple and robust compared to some sort of in-browser extension which saves the rendered DOM in the background.
It’s not robust for saving things like private chats, or anything else you need to be logged in to see. If I read your page correctly you’d need to do each of those manually. Unless the program can automatically take the cookies from a browser, and even then not everything gets saved. I’d want something that saved every element that was downloaded to my computer.
Also, if the page gets deleted fast, your program may miss it.
I’ve never needed to prove that. My concern is usually having a copy at all, and the IA is trusted enough that it’s a de facto proof.
I anticipate needing that, or at least finding it useful. Did you know the Internet Archive will delete based on a request by the website? I’m dealing with a specific domain where people are forging screenshots to prove their side, and something like this would come in handy, I think. Some of the sites are also deleting posts fast, which makes it hard to archive on a schedule.
find a download tool which will save the raw bit-level TCP/IP stream when you download a page and save it to an appropriate format like a PCAP file (maybe Wireshark supports this or it could be hooked into something like wget?); this preserves the HTTPS encryption and allows proving that the content came signed with the domain’s key (not that this means much), since the crypto can be checked to be valid and the stream replayed.
Why doesn’t it mean much?
For timestamping, doesn’t the TCP protocol have timestamps in it, or are those not signed or something? Also, many pages have timestamps embedded in them.
We do have different uses for this, obviously.
Would a proxy be easier to set up, and if so, how would I do that to cache all results?
If there was a program that functioned like I wanted it to, would you prefer it over your solution?
It’s not robust for saving things like private chats, or anything else you need to be logged in to see. If I read your page correctly you’d need to do each of those manually.
There are always edge-cases. A simple version of my solution can be coded up and fully implemented in an hour or less by a normal programmer (the hardest part is writing the SQL line for pulling out URLs from Firefox); the full version (a bot or daemon) could probably be done in only a few hours more.
Your desired solution, on the other hand, requires intimate familiarity with browser extensions and internals (if you want to save dynamic content and fancy things like Javascript-based chats, so much for trying to leverage existing solutions like the Mozilla Archive Format extension!).
Pareto.
Did you know the Internet Archive will delete based on a request by the website?
My understanding is that in all cases, these deletions are really more ‘marking private’, and if it’s done via robots.txt, well, one day that robots.txt may be gone.
Some of the sites are also deleting posts fast, which makes it hard to archive on a schedule.
Note the on-demand archiving services used by my archive bot, discussed in my page...
For timestamping, doesn’t the TCP protocol have timestamps in it, or are those not signed or something?
I’m not sure. It’s possible that the packets have timestamps, but the encrypted content does not, in which case you don’t get provable timestamping: the HTTPS encryption can be verified, but one could have modified the packets to read any timestamps one pleases because they’re ‘around’ the encryption, not ‘in’ it. If it does, then maybe you don’t need explicit trusted timestamping, but if it doesn’t (or you want to work with any other data sources which don’t luckily have timestamps built in just right) then the Bitcoin solution would work.
Also, many pages have timestamps embedded in them.
Now who’s satisficing.
If there was a program that functioned like I wanted it to, would you prefer it over your solution?
I would consider it, but I would be somewhat reluctant to switch because I wouldn’t trust the tool to not break horribly at some point.
Pfft says that this wouldn’t work at all for proof due to how TLS works. Is that true? Is there no hope, then?
If correct, it’s impossible to prove that any server sent a specific piece of data, because it’s encrypted symmetrically, and the only proof you have a zero-knowledge and non-transmissible. (Assuming I’m understanding properly.)
I dunno. I’m not an expert on TLS/HTTPS—I assumed that it would be implemented the natural way (signed and encrypted) where my proposal would work. But the devil is in the details...
There are very few audiences to whom a mathematical proof is more convincing than a screenshot. And those audiences probably don’t care about you. But if you really want to do it, you’re probably better off modifying a proxy than a browser.
gwern’s approach may be a good place to start.
I’ve seen that. He basically looks through his history with a script and then wgets it, as well as submitting to archive systems. That’s both wasteful on bandwidth as everything is downloaded twice, and anything not public needs to be done manually with cookies. He also can’t prove that they came from a site even if it used https.
I was thinking something like a browser extension that just made sure nothing downloaded was ever deleted. I wonder if chrome has a hook for when it internally deletes something, that a program could instead copy it and convert it to some format?
But it’s dead-simple and robust compared to some sort of in-browser extension which saves the rendered DOM in the background.
I’ve never needed to prove that. My concern is usually having a copy at all, and the IA is trusted enough that it’s a de facto proof.
But it’s possible:
find a download tool which will save the raw bit-level TCP/IP stream of packets when you download a page and save it to an appropriate format like a PCAP file (maybe Wireshark supports this or it could be hooked into something like wget?); this preserves the HTTPS encryption and allows proving that the content came signed with the domain’s key (not that this means much), since the crypto can be checked to be valid and the stream replayed.
This doesn’t prove it came from the domain at a particular time, but you can get timestamping using a trusted timestamping service such as the Bitcoin blockchain: as soon as the PCP file is closed and the webpage downloaded, take the hash and send a satoshi to the equivalent Bitcoin address. Transaction fees mean that this might get expensive to do for each web page, but there are a lot of ways you could optimize this (such as batching up an entire day’s worth of downloads into a single tarball archive, and timestamping that; big savings but also more granular timestamp; there are other schemes). Now you can prove the content came from someone with a particular HTTPS public key and that you downloaded it on or before a particular hour and date (roughly when the Bitcoin block with your transaction will be mined).
If anyone doubts you, they can take the relevant tarball, verify the hash and timestamp to when you say it was, then extract the relevant PCAP, verify the encryption, and then the displayed content.
Good luck.
Simply sniffing https packets is worthless. At most it tells you the length of the session. It’s exactly the adversary that TLS is designed to foil. You need the session key to decrypt it, so, you need to hook into the TLS implementation.
It seems someone has done the TLS hooking: TLSNotary. Whitepaper:
Seems it requires an active and online auditor server (which is far from ideal), but if someone were to run such a trusted auditor, then you get your HTTPS provability and can timestamp it as before.
I think that the people who wrote the code are running a server.
I am surprised that it is possible for a browser plug-in to hook so deeply into the browser to accomplish this.
Or make your own CA, install that certificate in your browser and MITM the connection. Probably easier than changing your browser’s code; can easily be done all on a single system.
Knew I forgot something.
It’s not robust for saving things like private chats, or anything else you need to be logged in to see. If I read your page correctly you’d need to do each of those manually. Unless the program can automatically take the cookies from a browser, and even then not everything gets saved. I’d want something that saved every element that was downloaded to my computer.
Also, if the page gets deleted fast, your program may miss it.
I anticipate needing that, or at least finding it useful. Did you know the Internet Archive will delete based on a request by the website? I’m dealing with a specific domain where people are forging screenshots to prove their side, and something like this would come in handy, I think. Some of the sites are also deleting posts fast, which makes it hard to archive on a schedule.
Why doesn’t it mean much?
For timestamping, doesn’t the TCP protocol have timestamps in it, or are those not signed or something? Also, many pages have timestamps embedded in them.
We do have different uses for this, obviously.
Would a proxy be easier to set up, and if so, how would I do that to cache all results?
If there was a program that functioned like I wanted it to, would you prefer it over your solution?
There are always edge-cases. A simple version of my solution can be coded up and fully implemented in an hour or less by a normal programmer (the hardest part is writing the SQL line for pulling out URLs from Firefox); the full version (a bot or daemon) could probably be done in only a few hours more.
Your desired solution, on the other hand, requires intimate familiarity with browser extensions and internals (if you want to save dynamic content and fancy things like Javascript-based chats, so much for trying to leverage existing solutions like the Mozilla Archive Format extension!).
Pareto.
My understanding is that in all cases, these deletions are really more ‘marking private’, and if it’s done via robots.txt, well, one day that robots.txt may be gone.
Note the on-demand archiving services used by my archive bot, discussed in my page...
I’m not sure. It’s possible that the packets have timestamps, but the encrypted content does not, in which case you don’t get provable timestamping: the HTTPS encryption can be verified, but one could have modified the packets to read any timestamps one pleases because they’re ‘around’ the encryption, not ‘in’ it. If it does, then maybe you don’t need explicit trusted timestamping, but if it doesn’t (or you want to work with any other data sources which don’t luckily have timestamps built in just right) then the Bitcoin solution would work.
Now who’s satisficing.
I would consider it, but I would be somewhat reluctant to switch because I wouldn’t trust the tool to not break horribly at some point.
Pfft says that this wouldn’t work at all for proof due to how TLS works. Is that true? Is there no hope, then?
If correct, it’s impossible to prove that any server sent a specific piece of data, because it’s encrypted symmetrically, and the only proof you have a zero-knowledge and non-transmissible. (Assuming I’m understanding properly.)
I dunno. I’m not an expert on TLS/HTTPS—I assumed that it would be implemented the natural way (signed and encrypted) where my proposal would work. But the devil is in the details...
Bandwidth is cheap.
There are very few audiences to whom a mathematical proof is more convincing than a screenshot. And those audiences probably don’t care about you. But if you really want to do it, you’re probably better off modifying a proxy than a browser.