Bittorrent specializes in short-term, spiky mass downloading. It’s not so hot for the long tail of years or decades. How many large torrents are alive after a few years?
This is exactly the problem that archive.org was set up to deal with. They’ve been doing an excellent job of it, and their cost-per-gigabyte-month is only going to drop as storage and bandwidth become cheaper.
Yes, they have been doing an excellent job. I’ve donated to them more than once because I find myself using the IA on a nigh-daily basis.
But the IA is no panacea. It can only store some categories of content reliably, and the rest is inaccessible. Nor have I seen them hold & distribute the truly enormous datasets that much research will use—the biggest files I’ve seen the IA offer for public download are in the single gigabytes or hundreds of megabytes range.
And what happens in twenty years when the journal goes out of business?
We have the data for 20 years longer than we would have if they never published it. And, if someone else happens to care about the subject we will probably have copies remaining somewhere.
In twenty years the price of storage will have fallen enough that some library has no problem storing a copy.
Let pubmed store the copy for medicine data.
Storage isn’t the principal cost even now. Ask some librarians; copyright clearance, format migration, metadata maintenance, bandwidth, accessibility… all of these cost much more than mere rotating disks. (Especially since 2tb costs ~$100 in 2010, and costs will only go down - $70 in October 2011.)
Let’s see if we can address each of these issues. Assume, as a baseline, that all this data gets hosted by archive.org and that storage costs are pretty much a non-issue in The Future.
Copyright clearance. I confess I’m not too familiar with copyright law, but shouldn’t it be possible for scientists to grant permission to host and distribute their datasets forever? That sounds like it should streamline things.
Format migration. This can be mitigated by using human-readable formats wherever possible. Fifty years from now people may not be able to read PDF files, but they’ll definitely be able to handle simple CSV data plus a README file describing the format. And XML is easy enough to puzzle out. (I resurrected a 20-year-old boolean logic synthesis program, and the data formats had not changed in that time. Plain text is easy, and it handles a lot.)
Metadata maintenance. The only metadata we’ll really need for these is a unique id that can be included in papers which wish to cite the dataset. Once you have that, hosting can be as simple as sticking a bunch of .tar.gz files in a directory and setting up a web server. I could do it in ten minutes. If you want more elaborate metadata, go right ahead, but remember that it’s the icing on the cake.
Bandwidth. Getting ever cheaper as we figure out how to push more data over the same fiber.
Accessibility. You don’t need to jump through hoops to make it accessible to the blind or something; simply having the raw data and an easy-to-use web site would be plenty accessible. Even if your web site redirected every tenth visitor to Goatse.cx, it would be better than the current situation.
We don’t have to make this perfect; we just have to build a base that works, and which can interoperate with other web sites. Just publicly hosting tarballs of the datasets, each one given a permanent unique identifier which is included in any papers citing it, would be fantastic.
I confess I’m not too familiar with copyright law, but shouldn’t it be possible for scientists to grant permission to host and distribute their datasets forever?
Scientists cannot even get their papers published under distributable terms or Free terms. The Open Access people have issues precisely because researchers don’t want to take the time to learn about all that and work through it, and copyright law, by default to all-rights-reserved, doesn’t help in the least. (This is one reason why such people try to get federal-supported papers to be mandated to be open access; defaults are incredibly important.) In many cases, they don’t have permission, or it’s too difficult to figure out who has permission. And publishers occasionally do nasty things like fire off DMCA takedowns in all directions. (The ACM being just the latest example.)
It should be possible. It often is. It’s also possible to run a marathon backwards.
This can be mitigated by using human-readable formats wherever possible.
The standard recommendation, incidentally. But this is not a cureall because data requires interpretation, and the entire computing world never has and never will switch entirely to textual formats. And as long as old binary or hard-to-read data is around, the costs will be paid. The Daily WTF as well furnishes evidence that even textual formats can require reverse-engineering (to say nothing of the reputation scientists have for bad coding).
The only metadata we’ll really need for these is a unique id that can be included in papers which wish to cite the dataset.
I have a problem, someone says; I know, I’ll use a Global Unique ID… A ID is perhaps the simplest possible solution, but it only works if you never need to do anything besides answer the question ‘is this dataset I’m looking at the one whose ID I know?’ You don’t get search, you don’t get history, or descriptions, or locations, or anything. One could be in the position of the would-be BitTorrent leech: one has the hashes and .torrent one needs, but there don’t seem to be any seeds...
I didn’t mean accessibility in the sense of catering to the blind (although that is an issue, textual formats alleviate it). I meant more along the lines of community issues, it needs to be publicly online, it needs to be well-known, needs to be well-used, easily searched or found, and have zero friction for use. It cannot be Citizendium; it must be Wikipedia. It cannot be like the obscure Open Access databases libraries try to maintain; it must be like ArXiv. There are scads of archive sites and libraries and whatnot; no one uses them because they’re too hard to remember which one to use when. Archive services benefit heavily from network effects.
And what happens in twenty years when the journal goes out of business?
Do you think that that’s worse than the way things are done currently?
Bittorrent? You can publish shasums of the data sets in the paper so you know it is the data you are looking for.
Bittorrent specializes in short-term, spiky mass downloading. It’s not so hot for the long tail of years or decades. How many large torrents are alive after a few years?
This is exactly the problem that archive.org was set up to deal with. They’ve been doing an excellent job of it, and their cost-per-gigabyte-month is only going to drop as storage and bandwidth become cheaper.
Yes, they have been doing an excellent job. I’ve donated to them more than once because I find myself using the IA on a nigh-daily basis.
But the IA is no panacea. It can only store some categories of content reliably, and the rest is inaccessible. Nor have I seen them hold & distribute the truly enormous datasets that much research will use—the biggest files I’ve seen the IA offer for public download are in the single gigabytes or hundreds of megabytes range.
We have the data for 20 years longer than we would have if they never published it. And, if someone else happens to care about the subject we will probably have copies remaining somewhere.
http://archive.org/
In twenty years the price of storage will have fallen enough that some library has no problem storing a copy. Let pubmed store the copy for medicine data.
Storage isn’t the principal cost even now. Ask some librarians; copyright clearance, format migration, metadata maintenance, bandwidth, accessibility… all of these cost much more than mere rotating disks. (Especially since 2tb costs ~$100 in 2010, and costs will only go down - $70 in October 2011.)
Let’s see if we can address each of these issues. Assume, as a baseline, that all this data gets hosted by archive.org and that storage costs are pretty much a non-issue in The Future.
Copyright clearance. I confess I’m not too familiar with copyright law, but shouldn’t it be possible for scientists to grant permission to host and distribute their datasets forever? That sounds like it should streamline things.
Format migration. This can be mitigated by using human-readable formats wherever possible. Fifty years from now people may not be able to read PDF files, but they’ll definitely be able to handle simple CSV data plus a README file describing the format. And XML is easy enough to puzzle out. (I resurrected a 20-year-old boolean logic synthesis program, and the data formats had not changed in that time. Plain text is easy, and it handles a lot.)
Metadata maintenance. The only metadata we’ll really need for these is a unique id that can be included in papers which wish to cite the dataset. Once you have that, hosting can be as simple as sticking a bunch of .tar.gz files in a directory and setting up a web server. I could do it in ten minutes. If you want more elaborate metadata, go right ahead, but remember that it’s the icing on the cake.
Bandwidth. Getting ever cheaper as we figure out how to push more data over the same fiber.
Accessibility. You don’t need to jump through hoops to make it accessible to the blind or something; simply having the raw data and an easy-to-use web site would be plenty accessible. Even if your web site redirected every tenth visitor to Goatse.cx, it would be better than the current situation.
We don’t have to make this perfect; we just have to build a base that works, and which can interoperate with other web sites. Just publicly hosting tarballs of the datasets, each one given a permanent unique identifier which is included in any papers citing it, would be fantastic.
http://opendatasets.archive.org/dataset/herlihy-moss-htm-1993.tar.gz
See that? It’s not a real URL, but I wish it were.
Scientists cannot even get their papers published under distributable terms or Free terms. The Open Access people have issues precisely because researchers don’t want to take the time to learn about all that and work through it, and copyright law, by default to all-rights-reserved, doesn’t help in the least. (This is one reason why such people try to get federal-supported papers to be mandated to be open access; defaults are incredibly important.) In many cases, they don’t have permission, or it’s too difficult to figure out who has permission. And publishers occasionally do nasty things like fire off DMCA takedowns in all directions. (The ACM being just the latest example.)
It should be possible. It often is. It’s also possible to run a marathon backwards.
The standard recommendation, incidentally. But this is not a cureall because data requires interpretation, and the entire computing world never has and never will switch entirely to textual formats. And as long as old binary or hard-to-read data is around, the costs will be paid. The Daily WTF as well furnishes evidence that even textual formats can require reverse-engineering (to say nothing of the reputation scientists have for bad coding).
I have a problem, someone says; I know, I’ll use a Global Unique ID… A ID is perhaps the simplest possible solution, but it only works if you never need to do anything besides answer the question ‘is this dataset I’m looking at the one whose ID I know?’ You don’t get search, you don’t get history, or descriptions, or locations, or anything. One could be in the position of the would-be BitTorrent leech: one has the hashes and .torrent one needs, but there don’t seem to be any seeds...
I didn’t mean accessibility in the sense of catering to the blind (although that is an issue, textual formats alleviate it). I meant more along the lines of community issues, it needs to be publicly online, it needs to be well-known, needs to be well-used, easily searched or found, and have zero friction for use. It cannot be Citizendium; it must be Wikipedia. It cannot be like the obscure Open Access databases libraries try to maintain; it must be like ArXiv. There are scads of archive sites and libraries and whatnot; no one uses them because they’re too hard to remember which one to use when. Archive services benefit heavily from network effects.