sketerpot comments on Frequentist Statistics are Frequently Subjective

sketerpot 7 Dec 2009 5:00 UTC
4 points
Let’s see if we can address each of these issues. Assume, as a baseline, that all this data gets hosted by archive.org and that storage costs are pretty much a non-issue in The Future.

Copyright clearance. I confess I’m not too familiar with copyright law, but shouldn’t it be possible for scientists to grant permission to host and distribute their datasets forever? That sounds like it should streamline things.

Format migration. This can be mitigated by using human-readable formats wherever possible. Fifty years from now people may not be able to read PDF files, but they’ll definitely be able to handle simple CSV data plus a README file describing the format. And XML is easy enough to puzzle out. (I resurrected a 20-year-old boolean logic synthesis program, and the data formats had not changed in that time. Plain text is easy, and it handles a lot.)

Metadata maintenance. The only metadata we’ll really need for these is a unique id that can be included in papers which wish to cite the dataset. Once you have that, hosting can be as simple as sticking a bunch of .tar.gz files in a directory and setting up a web server. I could do it in ten minutes. If you want more elaborate metadata, go right ahead, but remember that it’s the icing on the cake.

Bandwidth. Getting ever cheaper as we figure out how to push more data over the same fiber.

Accessibility. You don’t need to jump through hoops to make it accessible to the blind or something; simply having the raw data and an easy-to-use web site would be plenty accessible. Even if your web site redirected every tenth visitor to Goatse.cx, it would be better than the current situation.

We don’t have to make this perfect; we just have to build a base that works, and which can interoperate with other web sites. Just publicly hosting tarballs of the datasets, each one given a permanent unique identifier which is included in any papers citing it, would be fantastic.

http://opendatasets.archive.org/dataset/herlihy-moss-htm-1993.tar.gz

See that? It’s not a real URL, but I wish it were.
- gwern 7 Dec 2009 18:56 UTC
  2 points
  Parent
  
  I confess I’m not too familiar with copyright law, but shouldn’t it be possible for scientists to grant permission to host and distribute their datasets forever?
  
  Scientists cannot even get their papers published under distributable terms or Free terms. The Open Access people have issues precisely because researchers don’t want to take the time to learn about all that and work through it, and copyright law, by default to all-rights-reserved, doesn’t help in the least. (This is one reason why such people try to get federal-supported papers to be mandated to be open access; defaults are incredibly important.) In many cases, they don’t have permission, or it’s too difficult to figure out who has permission. And publishers occasionally do nasty things like fire off DMCA takedowns in all directions. (The ACM being just the latest example.)
  
  It should be possible. It often is. It’s also possible to run a marathon backwards.
  
  This can be mitigated by using human-readable formats wherever possible.
  
  The standard recommendation, incidentally. But this is not a cureall because data requires interpretation, and the entire computing world never has and never will switch entirely to textual formats. And as long as old binary or hard-to-read data is around, the costs will be paid. The Daily WTF as well furnishes evidence that even textual formats can require reverse-engineering (to say nothing of the reputation scientists have for bad coding).
  
  The only metadata we’ll really need for these is a unique id that can be included in papers which wish to cite the dataset.
  
  I have a problem, someone says; I know, I’ll use a Global Unique ID… A ID is perhaps the simplest possible solution, but it only works if you never need to do anything besides answer the question ‘is this dataset I’m looking at the one whose ID I know?’ You don’t get search, you don’t get history, or descriptions, or locations, or anything. One could be in the position of the would-be BitTorrent leech: one has the hashes and .torrent one needs, but there don’t seem to be any seeds...
  
  I didn’t mean accessibility in the sense of catering to the blind (although that is an issue, textual formats alleviate it). I meant more along the lines of community issues, it needs to be publicly online, it needs to be well-known, needs to be well-used, easily searched or found, and have zero friction for use. It cannot be Citizendium; it must be Wikipedia. It cannot be like the obscure Open Access databases libraries try to maintain; it must be like ArXiv. There are scads of archive sites and libraries and whatnot; no one uses them because they’re too hard to remember which one to use when. Archive services benefit heavily from network effects.