Free (old) scientific papers [Link]
Greg Maxwell is torrenting 33Gib of public domain JSTOR documents that were behind paywalls.
What’s your take on this, ethically, legally, etc?
ETA: More on this: http://gigaom.com/2011/07/21/pirate-bay-jstor/
If we are discussing ethics, it might be interesting to have some details on JSTOR’s finances. As a non-profit charity, such figures may inform our judgements...
I did one of my little Form 990 readthroughs on wikien-l: http://lists.wikimedia.org/pipermail/wikien-l/2011-July/109234.html
EDIT: librarian Andrew Gray did the same thing with an earlier filing; noticed some things I didn’t, like how few articles they sell to the general public: http://www.generalist.org.uk/blog/2011/jstor-where-does-your-money-go/
Ethically I’d say this is pretty well in the clear. The original printed versions of the papers Greg Maxwell’s released should all be in the public domain (and virtually none of their authors are still alive, so they’re hardly losing out). The material seems to already be publicly available through Google Books; all GM’s done is present the data without a clunky search interface. I can’t see this hurting JSTOR’s bottom line either, since all the papers come from just one journal, and who’s going to cancel their subscription because one journal became more accessible?
I’m more interested in whether it’ll ever be feasible to release more than one journal at a time. This Philosophical Transactions torrent is 34.9GB alone. Someone released a torrent of Science a couple of years ago that went over 100GB. Most journals would have a pre-1923 back catalogue smaller than those (PT and Science are both very old, and the Science torrent includes post-1922 issues) but collectively it’d still be a huge amount of material. That said, the ballpark numbers I get are less than I expected.
JSTOR has 777,061 pre-1923 items published in journals, and those items total 3,916,062 pages. All of the different pre-1923 incarnations of PT rack up 158,644 pages, 4% of the pre-1923 JSTOR total. So if it had the same page-to-file-size ratio as PT, pre-1923 JSTOR would take up 862GB. That’s big, but small enough to copy onto a $60 hard disk and share by BitTorrent. All that’s missing is someone able to scrape that much data from JSTOR and set up the public torrent for it. (According to the indictment, Aaron Swartz got “well over 4,000,000 articles from JSTOR”, which would be most of the pre-1923 articles, but I’ve seen nothing else to suggest he was going to host them all for public use. It is interesting that he managed to download so much of the database before getting rumbled.)
[Edited to clarify that I’m talking about pre-1923 JSTOR material.]
Practically, this is available on google books. It lacks the metadata, but google is good at searching.
Ethically, it is reprehensible to imply that Aaron Swartz was intent on piracy. (Does anyone have evidence that he was? ETA: He seems to have endorsed “guerrilla open access” in 2008; and his work on PACER is similar, though more clear-cut than this torrent. But he has used also datasets otherwise.)
Legally, it is clear-cut: digitizing a book creates a new document under copyright. Google and Gutenberg distribute these documents for free, but only for noncommercial use. ETA: this is not true about Gutenberg; it is not clear about Google.
Incidentally, the Royal Society has its own archive; this is not from JSTOR. ETA: no, this is from JSTOR.
What do you base this on? There is a comment on Hacker News stating the opposite.
I’ll concede that it is at least not clear. In any event, I was wrong about Project Gutenberg and Google’s position is not entirely clear to me. It is worth noting that the Royal Society is in the UK, where the law is definitely different, if not clear.
The Royal Society’s archive looks just as walled up as JSTOR’s. I picked out an arbitrary volume from 1693 and tried to get the full-text PDF for a paper, but it asks for a login. I don’t know if anyone can signup for free or has to buy an access package — in any case, it seems unnecessary to make people jump through hoops for papers that are (in their original form) public domain.
testing