Regarding frequentists’ concerns about subjectivity in the Bayesian interpretation:
We should learn to be content with happiness instead of “true happiness”, truth instead of “ultimate truth”, purpose instead of “transcendental purpose”, and morality instead of “objective morality”. [1]
...and randomness instead of “true randomness”. Mind the mind.
“In modern times there is absolutely no excuse for not publishing the raw data, but that’s another story.”
Nope. If, like many studies, you have data on real live humans there are perfectly-sane ethical and legal considerations which make publishing raw data a non-starter. Even publishing summaries is a problem; see the way the Wellcome Trust Case Control Studies recently hauled aggregate results of their website when it became clear individual diseases status could be ascertained from them.
There can also be difficulties with plain data size. Your average journal is not going to publish any tables of a few hundred GB-worth of data; and while you can certainly link to a place to download the files, how long are they going to sit there? It would be rather embarrassing if someone read your paper in ten years and your data server was no longer there.
The notion of changing your mind about the experimental procedure, and thereby changing the significance of the result, is a bit of a straw man. You establish the experimental procedure, then you run the test; your state of mind at the time you flipped the coin is a perfectly ordinary fact about the world, which can influence your priors in a nicely Bayesian way. Of course it’s possible to cheat and lie about what your state of mind actually was, but that’s not a problem of frequentist mathematics.
Bittorrent specializes in short-term, spiky mass downloading. It’s not so hot for the long tail of years or decades. How many large torrents are alive after a few years?
This is exactly the problem that archive.org was set up to deal with. They’ve been doing an excellent job of it, and their cost-per-gigabyte-month is only going to drop as storage and bandwidth become cheaper.
Yes, they have been doing an excellent job. I’ve donated to them more than once because I find myself using the IA on a nigh-daily basis.
But the IA is no panacea. It can only store some categories of content reliably, and the rest is inaccessible. Nor have I seen them hold & distribute the truly enormous datasets that much research will use—the biggest files I’ve seen the IA offer for public download are in the single gigabytes or hundreds of megabytes range.
And what happens in twenty years when the journal goes out of business?
We have the data for 20 years longer than we would have if they never published it. And, if someone else happens to care about the subject we will probably have copies remaining somewhere.
In twenty years the price of storage will have fallen enough that some library has no problem storing a copy.
Let pubmed store the copy for medicine data.
Storage isn’t the principal cost even now. Ask some librarians; copyright clearance, format migration, metadata maintenance, bandwidth, accessibility… all of these cost much more than mere rotating disks. (Especially since 2tb costs ~$100 in 2010, and costs will only go down - $70 in October 2011.)
Let’s see if we can address each of these issues. Assume, as a baseline, that all this data gets hosted by archive.org and that storage costs are pretty much a non-issue in The Future.
Copyright clearance. I confess I’m not too familiar with copyright law, but shouldn’t it be possible for scientists to grant permission to host and distribute their datasets forever? That sounds like it should streamline things.
Format migration. This can be mitigated by using human-readable formats wherever possible. Fifty years from now people may not be able to read PDF files, but they’ll definitely be able to handle simple CSV data plus a README file describing the format. And XML is easy enough to puzzle out. (I resurrected a 20-year-old boolean logic synthesis program, and the data formats had not changed in that time. Plain text is easy, and it handles a lot.)
Metadata maintenance. The only metadata we’ll really need for these is a unique id that can be included in papers which wish to cite the dataset. Once you have that, hosting can be as simple as sticking a bunch of .tar.gz files in a directory and setting up a web server. I could do it in ten minutes. If you want more elaborate metadata, go right ahead, but remember that it’s the icing on the cake.
Bandwidth. Getting ever cheaper as we figure out how to push more data over the same fiber.
Accessibility. You don’t need to jump through hoops to make it accessible to the blind or something; simply having the raw data and an easy-to-use web site would be plenty accessible. Even if your web site redirected every tenth visitor to Goatse.cx, it would be better than the current situation.
We don’t have to make this perfect; we just have to build a base that works, and which can interoperate with other web sites. Just publicly hosting tarballs of the datasets, each one given a permanent unique identifier which is included in any papers citing it, would be fantastic.
I confess I’m not too familiar with copyright law, but shouldn’t it be possible for scientists to grant permission to host and distribute their datasets forever?
Scientists cannot even get their papers published under distributable terms or Free terms. The Open Access people have issues precisely because researchers don’t want to take the time to learn about all that and work through it, and copyright law, by default to all-rights-reserved, doesn’t help in the least. (This is one reason why such people try to get federal-supported papers to be mandated to be open access; defaults are incredibly important.) In many cases, they don’t have permission, or it’s too difficult to figure out who has permission. And publishers occasionally do nasty things like fire off DMCA takedowns in all directions. (The ACM being just the latest example.)
It should be possible. It often is. It’s also possible to run a marathon backwards.
This can be mitigated by using human-readable formats wherever possible.
The standard recommendation, incidentally. But this is not a cureall because data requires interpretation, and the entire computing world never has and never will switch entirely to textual formats. And as long as old binary or hard-to-read data is around, the costs will be paid. The Daily WTF as well furnishes evidence that even textual formats can require reverse-engineering (to say nothing of the reputation scientists have for bad coding).
The only metadata we’ll really need for these is a unique id that can be included in papers which wish to cite the dataset.
I have a problem, someone says; I know, I’ll use a Global Unique ID… A ID is perhaps the simplest possible solution, but it only works if you never need to do anything besides answer the question ‘is this dataset I’m looking at the one whose ID I know?’ You don’t get search, you don’t get history, or descriptions, or locations, or anything. One could be in the position of the would-be BitTorrent leech: one has the hashes and .torrent one needs, but there don’t seem to be any seeds...
I didn’t mean accessibility in the sense of catering to the blind (although that is an issue, textual formats alleviate it). I meant more along the lines of community issues, it needs to be publicly online, it needs to be well-known, needs to be well-used, easily searched or found, and have zero friction for use. It cannot be Citizendium; it must be Wikipedia. It cannot be like the obscure Open Access databases libraries try to maintain; it must be like ArXiv. There are scads of archive sites and libraries and whatnot; no one uses them because they’re too hard to remember which one to use when. Archive services benefit heavily from network effects.
your state of mind at the time you flipped the coin is a perfectly ordinary fact about the world, which can influence your priors in a nicely Bayesian way
If you intend to flip the coin six times, then your null-hypothesis prior is “I will get 0 heads with probability 0.5^6, 1 head with probability 6*0.5^6, and so on”. If you intend to flip until you get a tail, the prior is “Probability 0.5 of one flip, 0.25 of two flips”, and so on.
Sorry, I was confused. Let me try to rephrase. Given some prior, your state of mind before the experiment affects your prediction of the outcome probabilities, and therefore informs your evaluation of the evidence. I should perhaps have said “affects the posterior” rather than “the prior”.
The exact example you’ve given (binomial versus negative binomial sampling distribution) is actually a counterexample to the above assertion. Those two distributions have the same likelihood function, so the evaluation of the evidence is the same under both scenarios. It’s true that the prior predictive distributions are different, but that doesn’t affect the posterior distribution of the parameter.
So it doesn’t matter whether the data were sampled according to Pr1 or Pr2. You can check that the binomial and negative binomial distributions satisfy the proportionality condition by looking them up in Wikipedia.
Regarding frequentists’ concerns about subjectivity in the Bayesian interpretation:
...and randomness instead of “true randomness”. Mind the mind.
“In modern times there is absolutely no excuse for not publishing the raw data, but that’s another story.”
Nope. If, like many studies, you have data on real live humans there are perfectly-sane ethical and legal considerations which make publishing raw data a non-starter. Even publishing summaries is a problem; see the way the Wellcome Trust Case Control Studies recently hauled aggregate results of their website when it became clear individual diseases status could be ascertained from them.
Fair point.
There can also be difficulties with plain data size. Your average journal is not going to publish any tables of a few hundred GB-worth of data; and while you can certainly link to a place to download the files, how long are they going to sit there? It would be rather embarrassing if someone read your paper in ten years and your data server was no longer there.
The notion of changing your mind about the experimental procedure, and thereby changing the significance of the result, is a bit of a straw man. You establish the experimental procedure, then you run the test; your state of mind at the time you flipped the coin is a perfectly ordinary fact about the world, which can influence your priors in a nicely Bayesian way. Of course it’s possible to cheat and lie about what your state of mind actually was, but that’s not a problem of frequentist mathematics.
Publish online. Let the journals maintain redundant servers. Keep hard copies for back-up just in case. This is really simple.
And what happens in twenty years when the journal goes out of business?
Do you think that that’s worse than the way things are done currently?
Bittorrent? You can publish shasums of the data sets in the paper so you know it is the data you are looking for.
Bittorrent specializes in short-term, spiky mass downloading. It’s not so hot for the long tail of years or decades. How many large torrents are alive after a few years?
This is exactly the problem that archive.org was set up to deal with. They’ve been doing an excellent job of it, and their cost-per-gigabyte-month is only going to drop as storage and bandwidth become cheaper.
Yes, they have been doing an excellent job. I’ve donated to them more than once because I find myself using the IA on a nigh-daily basis.
But the IA is no panacea. It can only store some categories of content reliably, and the rest is inaccessible. Nor have I seen them hold & distribute the truly enormous datasets that much research will use—the biggest files I’ve seen the IA offer for public download are in the single gigabytes or hundreds of megabytes range.
We have the data for 20 years longer than we would have if they never published it. And, if someone else happens to care about the subject we will probably have copies remaining somewhere.
http://archive.org/
In twenty years the price of storage will have fallen enough that some library has no problem storing a copy. Let pubmed store the copy for medicine data.
Storage isn’t the principal cost even now. Ask some librarians; copyright clearance, format migration, metadata maintenance, bandwidth, accessibility… all of these cost much more than mere rotating disks. (Especially since 2tb costs ~$100 in 2010, and costs will only go down - $70 in October 2011.)
Let’s see if we can address each of these issues. Assume, as a baseline, that all this data gets hosted by archive.org and that storage costs are pretty much a non-issue in The Future.
Copyright clearance. I confess I’m not too familiar with copyright law, but shouldn’t it be possible for scientists to grant permission to host and distribute their datasets forever? That sounds like it should streamline things.
Format migration. This can be mitigated by using human-readable formats wherever possible. Fifty years from now people may not be able to read PDF files, but they’ll definitely be able to handle simple CSV data plus a README file describing the format. And XML is easy enough to puzzle out. (I resurrected a 20-year-old boolean logic synthesis program, and the data formats had not changed in that time. Plain text is easy, and it handles a lot.)
Metadata maintenance. The only metadata we’ll really need for these is a unique id that can be included in papers which wish to cite the dataset. Once you have that, hosting can be as simple as sticking a bunch of .tar.gz files in a directory and setting up a web server. I could do it in ten minutes. If you want more elaborate metadata, go right ahead, but remember that it’s the icing on the cake.
Bandwidth. Getting ever cheaper as we figure out how to push more data over the same fiber.
Accessibility. You don’t need to jump through hoops to make it accessible to the blind or something; simply having the raw data and an easy-to-use web site would be plenty accessible. Even if your web site redirected every tenth visitor to Goatse.cx, it would be better than the current situation.
We don’t have to make this perfect; we just have to build a base that works, and which can interoperate with other web sites. Just publicly hosting tarballs of the datasets, each one given a permanent unique identifier which is included in any papers citing it, would be fantastic.
http://opendatasets.archive.org/dataset/herlihy-moss-htm-1993.tar.gz
See that? It’s not a real URL, but I wish it were.
Scientists cannot even get their papers published under distributable terms or Free terms. The Open Access people have issues precisely because researchers don’t want to take the time to learn about all that and work through it, and copyright law, by default to all-rights-reserved, doesn’t help in the least. (This is one reason why such people try to get federal-supported papers to be mandated to be open access; defaults are incredibly important.) In many cases, they don’t have permission, or it’s too difficult to figure out who has permission. And publishers occasionally do nasty things like fire off DMCA takedowns in all directions. (The ACM being just the latest example.)
It should be possible. It often is. It’s also possible to run a marathon backwards.
The standard recommendation, incidentally. But this is not a cureall because data requires interpretation, and the entire computing world never has and never will switch entirely to textual formats. And as long as old binary or hard-to-read data is around, the costs will be paid. The Daily WTF as well furnishes evidence that even textual formats can require reverse-engineering (to say nothing of the reputation scientists have for bad coding).
I have a problem, someone says; I know, I’ll use a Global Unique ID… A ID is perhaps the simplest possible solution, but it only works if you never need to do anything besides answer the question ‘is this dataset I’m looking at the one whose ID I know?’ You don’t get search, you don’t get history, or descriptions, or locations, or anything. One could be in the position of the would-be BitTorrent leech: one has the hashes and .torrent one needs, but there don’t seem to be any seeds...
I didn’t mean accessibility in the sense of catering to the blind (although that is an issue, textual formats alleviate it). I meant more along the lines of community issues, it needs to be publicly online, it needs to be well-known, needs to be well-used, easily searched or found, and have zero friction for use. It cannot be Citizendium; it must be Wikipedia. It cannot be like the obscure Open Access databases libraries try to maintain; it must be like ArXiv. There are scads of archive sites and libraries and whatnot; no one uses them because they’re too hard to remember which one to use when. Archive services benefit heavily from network effects.
How do you think it influences the priors?
If you intend to flip the coin six times, then your null-hypothesis prior is “I will get 0 heads with probability 0.5^6, 1 head with probability 6*0.5^6, and so on”. If you intend to flip until you get a tail, the prior is “Probability 0.5 of one flip, 0.25 of two flips”, and so on.
That’s the likelihood under p = 0.5, not the prior. We want to infer something about p, so the prior is a distribution on p, not on the data.
Sorry, I was confused. Let me try to rephrase. Given some prior, your state of mind before the experiment affects your prediction of the outcome probabilities, and therefore informs your evaluation of the evidence. I should perhaps have said “affects the posterior” rather than “the prior”.
The exact example you’ve given (binomial versus negative binomial sampling distribution) is actually a counterexample to the above assertion. Those two distributions have the same likelihood function, so the evaluation of the evidence is the same under both scenarios. It’s true that the prior predictive distributions are different, but that doesn’t affect the posterior distribution of the parameter.
Really? I find that counterintuitive; could you show me the calculation?
Suppose that there are two sampling distributions that satisfy (sorry about the lousy math notation) the proportionality relationship,
Pr1(data | parameter) = k * Pr2(data | parameter)
where k may depend on the data but not on the parameter. Then the same proportionality relationship holds for the prior predictive distributions,
Pr1(data) = Integral { Pr1(data | parameter) Pr(parameter) d(parameter) }
Pr1(data) = Integral { k Pr2(data | parameter) Pr(parameter) d(parameter) }
Pr1(data) = k Integral { Pr2(data | parameter) Pr(parameter) d(parameter) }
Pr1(data) = k Pr2(data)
Now write out Bayes’ theorem:
Pr(parameter | data) = Pr(parameter) Pr1(data | parameter) / Pr1(data)
Pr(parameter | data) = Pr(parameter) k Pr2(data | parameter) / (k Pr2(data) )
Pr(parameter | data) = Pr(parameter) * Pr2(data | parameter) / Pr2(data))
So it doesn’t matter whether the data were sampled according to Pr1 or Pr2. You can check that the binomial and negative binomial distributions satisfy the proportionality condition by looking them up in Wikipedia.
Your argument is convincing; I sit corrected.