Somewhat off-topic and answering an ancient comment, but a useful reminder of how important endeavours can actually be horribly short of resources and much more fragile than people think:
Rather than elections, Wikipedia would better serve its public IMO by scrapping elections and making it as easy as possible for groups to fork Wikipedia. Putting the content under a permissive, open-source license was a major step in that direction. The 2 major remaining steps IMO are a technical provision by which every competing encyclopedia’s software may be notified of every change to every Wikipedia page as soon as the change is saved and the development of search engines able to lead the surfer through the bewildering array of world views and editorial approaches nurtured by the governance structure I just described.
We know this :-) Wikipedia as monopoly provider of the world’s encyclopedia is an anti-pattern. But the network effects are very powerful.
This means we are a great big single point of failure. Our single data centre is one hurricane away from disappearing. Even making a good backup of English Wikipedia is a remarkably difficult endeavour because it’s SO BIG. A billion and a half words. Can you mentally grasp how big that is? I sure can’t.
And the distributed network you outline would be a wonderful thing. But, like most things that it would be nice to do with Wikipedia, it requires coding on MediaWiki. Lots of people have “Why don’t you …” technical ideas—nearly none of them follow them with the requisite code.
The budget for this year includes a pile of cash on technical resources: a second data centre and a lot more coders. We’re also developing a pattern where young whizzkids work for WMF for a couple of years at charity pay and go off to make a bundle in industry—and that’s fine by us.
A billion and a half words. Can you mentally grasp how big that is?
1.5N GB, where N is the average bytes per English word. Multiply by, say, 5 for the HTML overhead and it would still all fit onto a 64GB memory stick uncompressed, though I’d want something faster for actually accessing it.
It would actually be larger, as you’d need all the images as well, and you’d want the ancillary things like wikisource and wiktionary (I don’t know if those are independent projects or if they’re included in your figure) but even so, it sounds like the whole thing would easily fit onto a typical hard disc.
I have all of the english wikipedia available for offline searching on my phone. It’s big, sure, but it doesn’t fill the memory card by any means (and this is just the default one that came with the phone).
For offline access on a windows computer, WikiTaxi is a reasonable solution.
I’d recommend that everyone who can carry around offline versions of wikipedia. I consider it part of my disaster preparedness, not to mention the fun of learning new things by hitting the ‘random article’ button.
In the context you started out talking about—making a backup—mentally grasping how much data that is as text seems far less relevant than mentally grasping how much data that is as a fraction of the storage capacity of a phone, or grasping it as an amount of time required to transfer it from one network location to another.
It sounds like you’ve switched contexts along the way, though I’m not really sure to what.
It’s roughly as many words as are spoken worldwide in 2.5 seconds, assuming 7450 words per person per day. It’s very probably less than the number of English words spoken in a minute. It’s also about the number of words you can expect to speak in 550 years. That means there might be people alive who’ve spoken that many words, given the variance of word-production counts.
So, a near inconceivable quantity for one person, but a minute fraction of total human communication.
Note that the size of 1.5 billion words isn’t what really makes it so large. The real issue is the sheer number of revisions which increases the database size by orders of magnitude. The large number of images also contribute.
Somewhat off-topic and answering an ancient comment, but a useful reminder of how important endeavours can actually be horribly short of resources and much more fragile than people think:
We know this :-) Wikipedia as monopoly provider of the world’s encyclopedia is an anti-pattern. But the network effects are very powerful.
This means we are a great big single point of failure. Our single data centre is one hurricane away from disappearing. Even making a good backup of English Wikipedia is a remarkably difficult endeavour because it’s SO BIG. A billion and a half words. Can you mentally grasp how big that is? I sure can’t.
And the distributed network you outline would be a wonderful thing. But, like most things that it would be nice to do with Wikipedia, it requires coding on MediaWiki. Lots of people have “Why don’t you …” technical ideas—nearly none of them follow them with the requisite code.
The budget for this year includes a pile of cash on technical resources: a second data centre and a lot more coders. We’re also developing a pattern where young whizzkids work for WMF for a couple of years at charity pay and go off to make a bundle in industry—and that’s fine by us.
1.5N GB, where N is the average bytes per English word. Multiply by, say, 5 for the HTML overhead and it would still all fit onto a 64GB memory stick uncompressed, though I’d want something faster for actually accessing it.
It would actually be larger, as you’d need all the images as well, and you’d want the ancillary things like wikisource and wiktionary (I don’t know if those are independent projects or if they’re included in your figure) but even so, it sounds like the whole thing would easily fit onto a typical hard disc.
I have all of the english wikipedia available for offline searching on my phone. It’s big, sure, but it doesn’t fill the memory card by any means (and this is just the default one that came with the phone).
For offline access on a windows computer, WikiTaxi is a reasonable solution.
I’d recommend that everyone who can carry around offline versions of wikipedia. I consider it part of my disaster preparedness, not to mention the fun of learning new things by hitting the ‘random article’ button.
No. You or I can say the numbers. But can you mentally grasp how much text that is? I doubt it.
Oh, and English Wikipedia is now being written faster than anyone could possibly read it.
In the context you started out talking about—making a backup—mentally grasping how much data that is as text seems far less relevant than mentally grasping how much data that is as a fraction of the storage capacity of a phone, or grasping it as an amount of time required to transfer it from one network location to another.
It sounds like you’ve switched contexts along the way, though I’m not really sure to what.
Yeah, I went off on a sidetrack of expressing how flabbergasted I am at the size of the thing. Sorry about that.
It’s roughly as many words as are spoken worldwide in 2.5 seconds, assuming 7450 words per person per day. It’s very probably less than the number of English words spoken in a minute. It’s also about the number of words you can expect to speak in 550 years. That means there might be people alive who’ve spoken that many words, given the variance of word-production counts.
So, a near inconceivable quantity for one person, but a minute fraction of total human communication.
Note that the size of 1.5 billion words isn’t what really makes it so large. The real issue is the sheer number of revisions which increases the database size by orders of magnitude. The large number of images also contribute.
Yeah, it’s the full history dump that basically hasn’t worked properly in years.