I’m not a fan of the (a) notation, or any link annotation for providing an archive link, since it’s not useful for the reader. An archive link is infrastructure: most readers don’t care about whether a link is an archive link or not. They might care about whether it’s a PDF or not, or whether it’s a website they like or dislike, but they don’t care whether a link works or not—the link should Just Work. ‘Silence is golden.’
Any new syntax should support either important semantics or important functionality. But this doesn’t support semantics, because it essentially doubles the overhead of every link by using up 4 characters (space, 2 parentheses, and ‘a’) without, unlike regular link annotations (1 character length), actually providing any new information (since you can generally assume that a link is available in the IA or elsewhere if the author cares enough about the issue to do something like include (a) links!). And the functionality is one that will be rarely exercised by users, who will click on only a few links and will click on the archived version for only a small subset of said links, unless link rot is a huge issue—in which case, why are you linking to the broken link at all instead of the working archived version? (Users can also implement this client-side by a small JS which inserts a stub link to IA, which is another red flag: if something is so simple and mindless that it can be done client-side at runtime, what value is it adding, really?)
So my opinion on archive links is that you should ensure that archives exist at all, you should blacklist domains which are particularly unreliable (I have a whole list of domains I distrust in my lint script, like SSRN or ResearchGate or Medium) and pre-emptively link to archived versions of them to save everyone the grief, and you should fix other links as they die. But simply mindlessly linking to archive links for all external links adds a lot of visual clutter and is quite an expense.
One place where archive links might make sense are places where you can’t or won’t update the page but you also don’t want to pre-emptively use archive links for everything. If you are writing an academic paper, say, the journal will not let you edit the PDF as links die, and you ideally want it to be as useful in a few decades as it is now; academic norms tend to frown on not linking the original URL, so you are forced to include both. (Not particularly applicable to personal websites like Zuck or Guzey or my site.)
Another idea is a more intelligent hybrid: include the archive links only on links suspected or known to be dead. For example, at compile time, links could be checked every once in a while, and broken links get an annotation; ideally the author would fix all such links, but of course in practice many broken links will be neglected indefinitely… This could also be done at runtime: when the user mouses over a link, similar to my preview link previews, a JS library could prefetch the page and if there is an outright error, rewrite it to an archive link. (Even fancier: links could come with perceptual hashes and the JS library could prefetch and check that the page looks right—this catches edgecases like paywalls, silent redirects, deletions, etc.)
An open question for me is whether it makes sense to not pre-emptively archive everything. This is a conclusion I have tried to avoid in favor of a more moderate strategy of being careful in linking and repairing dead links, but as time passes, I increasingly question this. I find it incredibly frustrating just how many links continuously die on gwern.net, and die in ways that are silent and do not throw errors, like redirecting PDFs to homepages. (Most egregiously, yesterday I discovered from a reader complaint that CMU’s official library repository of papers has broken all PDF links and silently redirected them to the homepage. WTF!) It is only a few percent at most (for now...), but I wish it were zero percent. I put a great deal of work into my pages, making them be well-referenced, well-linked, thorough, and carefully edited down to the level of spelling errors and nice typography and link popup annotations etc, and then random external links die in a way I can’t detect and mar the experience for readers! Maaaaan.
When I think about external links, it seems that most of them can be divided into either reliable links which will essentially never go down (Wikipedia) and static links which rarely or never change could just as well be hosted on gwern.net (practically every PDF link ever). The former can be excluded from archiving, but why not just host all of the latter on gwern.net to begin with? It wouldn’t be hard to write a compile-time Pandoc plugin to automatically replace every external link not on a whitelist with a link to a local ArchiveBox static mirror. It might cost a good 30GB+ of space to host ~10k external links for me, but that’s not that much, prices fall a little every year, and it would save a great deal of time and frustration in the long run.
An open question for me is whether it makes sense to not pre-emptively archive everything.
Update: I ultimately decided to give this a try, using SingleFile to capture static snapshots.
Detailed discussion: https://www.reddit.com/r/gwern/comments/f5y8xl/new_gwernnet_feature_local_link_archivesmirrors/ It currently costs ~5300 links/20GB, which is not too bad but may motivate me to find new hosting as I think it will substantially increase my monthly S3 bandwidth bill. The snapshots themselves look pretty good and no one has reported yet serious problems.… Too early to say, but I’m happy to be finally giving it a try.
Thanks, this is great. (And I didn’t know about your Archiving URLs page!)
And the functionality is one that will be rarely exercised by users, who will click on only a few links and will click on the archived version for only a small subset of said links, unless link rot is a huge issue—in which case, why are you linking to the broken link at all instead of the working archived version?
I feel like I’m often publishing content with two audiences in mind – my present-tense audience and a future audience who may come across the post.
The original link feels important to include because it’s more helpful to the present-tense audience. e.g. Often folks update the content of a linked page in response to reactions elsewhere, and it’s good to be able quickly point to the latest version of the link.
The archived link is more aimed at the future audience. By the time they stumble across the post, the original link will likely be broken, and there’s a better chance that the archived version will still be intact. (e.g. many of the links on Aaron Swartz’s blog are now broken; whenever I read it I find myself wishing there were convenient archived versions of the links).
Certainly there are links which are regularly updated, like Wikipedia pages. They should be whitelisted. There are others which wouldn’t make any sense to archive, stuff like services or tools—something like Waifu Labs which I link in several places wouldn’t make much sense to ‘archive’ because the entire point is to interact with the service and generate images.
But examples like blogs or LW pages make sense to archive after a particular timepoint. For example, many blogs or websites like Reddit lock comments after a set number of days. Once that’s passed, typically nothing in the page will change substantially (for the better, anyway) except to be deleted. I think most of my links to blogs are of that type.
Even on LW, where threads can be necroed at any time, how often does anyone comment on an old post, and if your archived copy happens to omit some stray recent comments, how big a deal is that? Acceptable collateral damage compared to a website where 5 or 10% of links are broken and the percentage keeps increasing with time, I’d say...
For this issue, you could implement something like a ‘first seen’ timestamp in your link database and only create the final archive & substituting after a certain time period—I think a period like 3 months would capture 99% of the changes which are ever going to be made, while not risking exposing readers to too much linkrot.
For this issue, you could implement something like a ‘first seen’ timestamp in your link database and only create the final archive & substituting after a certain time period—I think a period like 3 months would capture 99% of the changes which are ever going to be made, while not risking exposing readers to too much linkrot.
This makes sense, but it takes a lot of activation energy. I don’t think a practice like this will spread (like even I probably won’t chunk out the time to learn how to implement it, and I care a bunch about this stuff).
Plausibly “(a)” could spread in some circles – activation energy is low and it only adds 10-20 seconds of friction per archived link.
But even “(a)” probably won’t spread far (10-20 seconds of friction per link is too much for almost everyone). Maybe there’s room for a company doing this as a service...
But even “(a)” probably won’t spread far (10-20 seconds of friction per link is too much for almost everyone). Maybe there’s room for a company doing this as a service...
If adoption is your only concern, doing it website by website is hopeless in the first place. Your only choice is creating some sort of web browser plugin to do it automatically.
I’m not a fan of the
(a)
notation, or any link annotation for providing an archive link, since it’s not useful for the reader. An archive link is infrastructure: most readers don’t care about whether a link is an archive link or not. They might care about whether it’s a PDF or not, or whether it’s a website they like or dislike, but they don’t care whether a link works or not—the link should Just Work. ‘Silence is golden.’Any new syntax should support either important semantics or important functionality. But this doesn’t support semantics, because it essentially doubles the overhead of every link by using up 4 characters (space, 2 parentheses, and ‘a’) without, unlike regular link annotations (1 character length), actually providing any new information (since you can generally assume that a link is available in the IA or elsewhere if the author cares enough about the issue to do something like include
(a)
links!). And the functionality is one that will be rarely exercised by users, who will click on only a few links and will click on the archived version for only a small subset of said links, unless link rot is a huge issue—in which case, why are you linking to the broken link at all instead of the working archived version? (Users can also implement this client-side by a small JS which inserts a stub link to IA, which is another red flag: if something is so simple and mindless that it can be done client-side at runtime, what value is it adding, really?)So my opinion on archive links is that you should ensure that archives exist at all, you should blacklist domains which are particularly unreliable (I have a whole list of domains I distrust in my lint script, like SSRN or ResearchGate or Medium) and pre-emptively link to archived versions of them to save everyone the grief, and you should fix other links as they die. But simply mindlessly linking to archive links for all external links adds a lot of visual clutter and is quite an expense.
One place where archive links might make sense are places where you can’t or won’t update the page but you also don’t want to pre-emptively use archive links for everything. If you are writing an academic paper, say, the journal will not let you edit the PDF as links die, and you ideally want it to be as useful in a few decades as it is now; academic norms tend to frown on not linking the original URL, so you are forced to include both. (Not particularly applicable to personal websites like Zuck or Guzey or my site.)
Another idea is a more intelligent hybrid: include the archive links only on links suspected or known to be dead. For example, at compile time, links could be checked every once in a while, and broken links get an annotation; ideally the author would fix all such links, but of course in practice many broken links will be neglected indefinitely… This could also be done at runtime: when the user mouses over a link, similar to my preview link previews, a JS library could prefetch the page and if there is an outright error, rewrite it to an archive link. (Even fancier: links could come with perceptual hashes and the JS library could prefetch and check that the page looks right—this catches edgecases like paywalls, silent redirects, deletions, etc.)
An open question for me is whether it makes sense to not pre-emptively archive everything. This is a conclusion I have tried to avoid in favor of a more moderate strategy of being careful in linking and repairing dead links, but as time passes, I increasingly question this. I find it incredibly frustrating just how many links continuously die on
gwern.net
, and die in ways that are silent and do not throw errors, like redirecting PDFs to homepages. (Most egregiously, yesterday I discovered from a reader complaint that CMU’s official library repository of papers has broken all PDF links and silently redirected them to the homepage. WTF!) It is only a few percent at most (for now...), but I wish it were zero percent. I put a great deal of work into my pages, making them be well-referenced, well-linked, thorough, and carefully edited down to the level of spelling errors and nice typography and link popup annotations etc, and then random external links die in a way I can’t detect and mar the experience for readers! Maaaaan.When I think about external links, it seems that most of them can be divided into either reliable links which will essentially never go down (Wikipedia) and static links which rarely or never change could just as well be hosted on
gwern.net
(practically every PDF link ever). The former can be excluded from archiving, but why not just host all of the latter ongwern.net
to begin with? It wouldn’t be hard to write a compile-time Pandoc plugin to automatically replace every external link not on a whitelist with a link to a local ArchiveBox static mirror. It might cost a good 30GB+ of space to host ~10k external links for me, but that’s not that much, prices fall a little every year, and it would save a great deal of time and frustration in the long run.Update: I ultimately decided to give this a try, using SingleFile to capture static snapshots.
Detailed discussion: https://www.reddit.com/r/gwern/comments/f5y8xl/new_gwernnet_feature_local_link_archivesmirrors/ It currently costs ~5300 links/20GB, which is not too bad but may motivate me to find new hosting as I think it will substantially increase my monthly S3 bandwidth bill. The snapshots themselves look pretty good and no one has reported yet serious problems.… Too early to say, but I’m happy to be finally giving it a try.
Ironically, the explanation of your archives mirror is now also inaccessible (fortunately, it is still available on archive.org).
Thanks, this is great. (And I didn’t know about your Archiving URLs page!)
I feel like I’m often publishing content with two audiences in mind – my present-tense audience and a future audience who may come across the post.
The original link feels important to include because it’s more helpful to the present-tense audience. e.g. Often folks update the content of a linked page in response to reactions elsewhere, and it’s good to be able quickly point to the latest version of the link.
The archived link is more aimed at the future audience. By the time they stumble across the post, the original link will likely be broken, and there’s a better chance that the archived version will still be intact. (e.g. many of the links on Aaron Swartz’s blog are now broken; whenever I read it I find myself wishing there were convenient archived versions of the links).
Certainly there are links which are regularly updated, like Wikipedia pages. They should be whitelisted. There are others which wouldn’t make any sense to archive, stuff like services or tools—something like Waifu Labs which I link in several places wouldn’t make much sense to ‘archive’ because the entire point is to interact with the service and generate images.
But examples like blogs or LW pages make sense to archive after a particular timepoint. For example, many blogs or websites like Reddit lock comments after a set number of days. Once that’s passed, typically nothing in the page will change substantially (for the better, anyway) except to be deleted. I think most of my links to blogs are of that type.
Even on LW, where threads can be necroed at any time, how often does anyone comment on an old post, and if your archived copy happens to omit some stray recent comments, how big a deal is that? Acceptable collateral damage compared to a website where 5 or 10% of links are broken and the percentage keeps increasing with time, I’d say...
For this issue, you could implement something like a ‘first seen’ timestamp in your link database and only create the final archive & substituting after a certain time period—I think a period like 3 months would capture 99% of the changes which are ever going to be made, while not risking exposing readers to too much linkrot.
This makes sense, but it takes a lot of activation energy. I don’t think a practice like this will spread (like even I probably won’t chunk out the time to learn how to implement it, and I care a bunch about this stuff).
Plausibly “(a)” could spread in some circles – activation energy is low and it only adds 10-20 seconds of friction per archived link.
But even “(a)” probably won’t spread far (10-20 seconds of friction per link is too much for almost everyone). Maybe there’s room for a company doing this as a service...
If adoption is your only concern, doing it website by website is hopeless in the first place. Your only choice is creating some sort of web browser plugin to do it automatically.
The script now exists: https://www.andzuck.com/projects/archivify/
Update: Brave Browser now gives an option to search for archived versions whenever it lands on a “page does not exist”
Not my only concern but definitely seems important. (Otherwise you’re constrained by what you can personally maintain.)
A browser plugin seems like a good approach.