DataPacRat comments on Open thread, Sep. 12 - Sep. 18, 2016

DataPacRat Sep 12, 2016, 10:09 PM
5 points
Time to rebuild a library

My 5 terabyte harddrive went poof this morning, and silly me hadn’t bought data-recovery insurance. Fortunately, I still have other copies of all my important data, and it’ll just take a while to download everything else I’d been collecting.

Which brings up the question: What info do you feel it’s important to have offline copies of, gathered from the whole gosh-dang internet? A recent copy of Wikipedia and the Project Gutenberg DVD are the obvious starting places… which other info do you think pays the rent of its storage space?
- ChristianKl Sep 13, 2016, 10:01 AM
  7 points
  Parent
  I don’t see much value in having a recent copy of Wikipedia or Project Gutenberg on my computer. In both cases the availability of the information is secured by other parties. It’s more valuable to make sure that I store information that’s not protected by other people
  - DataPacRat Sep 14, 2016, 12:01 AM
    3 points
    Parent
    Someone Is Learning How to Take Down the Internet. What will you do when the only data you have access to is whatever you have stored locally?
    - Lumifer Sep 14, 2016, 1:42 AM
      4 points
      Parent
      
      What will you do when the only data you have access to is whatever you have stored locally?
      
      Look lovingly at my store of beans & ammo :-P
      
      You do realize that if the whole ’net goes down for more than a few hours, lack of access to Wikipedia is not going to be your most pressing problem..?
      - Houshalter Sep 14, 2016, 3:10 PM
        2 points
        Parent
        I have found wikipedia and other locally saved content interesting to read when the internet was down for long periods of time. It’s weird to lose the modern ability we take for granted, that we can just look anything up whenever we are curious.
        
        If the world does collapse access to wikipedia could be enormously useful. Imagine needing to look up what plants are edible, or how to hunt, or how long to wait before nuclear fallout disperses, etc.
        Lumifer Sep 14, 2016, 4:41 PM
        5 points
        Parent
        
        If the world does collapse access to wikipedia could be enormously useful.
        
        What makes you think you’ll have electricity in a TEOTWAWKI scenario? I’ll still take beans & ammo (and maybe a paper survivalist book).
        
        On a more general level, if you desire to prepare for the civilization collapse, downloading Wikipedia to your local hard drive is probably not the right place to start.
        DataPacRat Sep 14, 2016, 5:53 PM
        8 points
        Parent
        
        not the right place to start
        
        Who says that’s where I’m starting? :)
        
        I already have my short-term physical supplies, including water, food, camping gear, and AA-battery-powerable handheld ham radio. I also have a highly-portable solar panel capable of keeping my phone, and the offline copy of Wikipedia I keep on its SD Card, functioning regardless of the power grid; and I have enough battery-backup stuff at home to run my laptop long enough to copy the latest Wikipedia dump (and whatever emergency-survival ebooks I’ve collected by then) onto that SD card.
        skeptical_lurker Sep 14, 2016, 5:25 PM
        4 points
        Parent
        Water, tinned food and ammo (if you live somewhere where firearms are legal) is probably the most important, but wrapping some electronic gadgets in tin foil (would that sheild from emp blasts?) and buying some solar panels or a generator could be pretty useful too. For instance, a radio would be very useful for listening to the army trying to organise survivors.
        Houshalter Sep 15, 2016, 1:19 AM
        0 points
        Parent
        I have a generators and a printer to print any pages I need. It might be worth looking into a low power device that could read text and require minimal batteries or maybe a solar panel.
- Baughn Sep 13, 2016, 9:26 PM
  2 points
  Parent
  Data recovery is a last-ditch effort that often as enough fails, and if it succeeds will only get you back kilobytes or megabytes of your most critical material. (Unless you’re lucky enough that it’s actually a controller failure.)
  
  If you want to avoid disk failures, invest instead in backups.
- entirelyuseless Sep 13, 2016, 3:40 AM
  1 point
  Parent
  I have two hard drives, one larger than the other, with the smaller being backed up the larger. When the smaller drive is filled, or when it fails, then I simply buy a new drive, still larger, and the previous larger becomes the new smaller. So far there is no end in view in this process.
  
  I also have everything in an encrypted online backup with CrashPlan.
- WhySpace_duplicate0.9261692129075527 Sep 16, 2016, 12:36 AM
  0 points
  Parent
  Depends how much storage space you are willing to buy.
  
  One of my fantasies is a Raspberry Pi that automatically downloads all Wikipedia updates each month or so, to keep a local copy. The ultimate version of this would do the same for every new academic article available on Sci-Hub.
  
  Sci-Hub is the largest collection of scientific papers on the planet, and has over 58 million academic papers. If they average 100 kB a piece, that’s only 5.8 TB. If they average 1MB each, then you would need to shell out some decent cash, but you could in theory download all available academic papers.
  
  Someone may even have already done something like this, and put the script on GitHub or somewhere. (I haven’t looked.)
  
  (Also, nice username. :) )
  
  EDIT: It turns out there’s a custom built app for downloading and viewing Wikipedia in various languages. It’s available on PCs, Android phones, and there’s already a version made specially for the Pi: http://xowa.org/home/wiki/Help/Download_XOWA.html
  
  I wonder how difficult it would be to translate all of Sci-Hub into a wiki format that the app could add and read. You’d probably have to modify the app slightly, in order to divide up all the Sci-Hub articles among multiple hard drives. It might make the in-app search feature take forever, for instance. And obviously it wouldn’t work for the Android app, since there’s not enough space on a MicroSD card. (Although, maybe a smaller version could be made, containing only the top 32GB of journal articles with the most citations, plus all review articles.)
  
  Even just converting science into a Wikipedia-like format would be useful for the sake of open access. Imagine if all citations in a paper were a hyperlink away, and the abstract would display if you hovered your mouse over the link. (The XOWA app does this for Wikipedia links.)
  - Viliam Sep 16, 2016, 2:52 PM
    5 points
    Parent
    
    Even just converting science into a Wikipedia-like format would be useful for the sake of open access. Imagine if all citations in a paper were a hyperlink away, and the abstract would display if you hovered your mouse over the link.
    
    YES! YES! YES! And this could be done pretty much automatically. Also, links in the reverse direction: “who cited this paper?” with abstracts in tooltips.
    
    But there is much more that could be done in the hypothetical Science Wiki. For example, imagine that the reverse citations that disagree with the original paper would appear in a different color or with a different icon, so you could immediately check “who disagree with this paper?”. That would already require some human work (unfortunately, with all the problems that follow, such as edit wars and editor corruption). Or imagine having a “Talk page” for each of these papers. Imagine people trying to write better third-party abstracts: more accessible, less buzzwords, adding some context from later research. Imagine people trying to write a simpler version of the more popular papers...
    
    The science could be made more accessible and popular.
    - WhySpace_duplicate0.9261692129075527 Sep 17, 2016, 5:09 AM
      4 points
      Parent
      One of my first thoughts was glosses.
      
      If I recall, in the early middle ages, one of the main ways by which philosophy and proto-science advanced was through the extensive use of glosses. (as adapted from biblical glosses) Contemporary thinkers would all write commentaries on various works of Aristotle. At first, these were confined to the margins of the manuscripts being copied, but later they were published separately.
      
      Since Aristotle had a fairly comprehensive philosophy, this meant reading all the glosses on a particular work of his brought you up to speed with the current state of knowledge on that topic. This had the effect of creating domains of knowledge, and scholarly specialization first nucleated around individual texts.
      
      I say this, because one of the main problems with science today is just that there is so much of it. This makes it difficult to have interdisciplinary#Interactions) exchange of knowledge and meaningful communication and coordination.
      
      Having a search engine like Google Scholar helps enormously, but it can be difficult to sift through a body of knowledge if you don’t already know all the right keywords to search for. The existence of review articles also helps summarize, but it’s still a somewhat clumsy solution. Why not replace the review article with a wiki?
      
      It would be nice to have everything arranged formally and hierarchically, by field, sub-field, and then by topic within that sub field. Each level could have their own publicly editable summary, if there was enough human effort to maintain it. Imagine all that also ordered by citation index, and with links to all relevant news articles, blogs, and reddit threads commenting on each article.
      
      Read a tabloid headline starting with “Scientists Say...”? Go directly to the wiki, and check what other scientists and the internet think of the research quality, background context, reputability, etc. Maybe even have a prediction market on whether the findings will replicate.
      
      A huge part of the scientific discourse is no longer happening in the journal articles themselves, but this could capture it all in one place.
    - Douglas_Knight Sep 17, 2016, 12:00 AM
      3 points
      Parent
      Citeseer was originally supposed to serve a similar purpose by automatically extracting the excerpts where the paper was cited, so that the human could judge whether they were positive or negative. But it seems to have been abandoned after the advent of google scholar, or maybe before.
    - [deleted]Sep 17, 2016, 5:06 AM
      0 points
      Parent
      Innovation in science may undermine the efficacy of science if science is a process.
  - DataPacRat Sep 16, 2016, 4:52 PM
    0 points
    Parent
    For Wikipedia, I’ve been reasonably satisfied with Kiwix for software, and their updated-every-month-or-three copies of Wikipedia, and the related Wikimedia foundation sites, at http://wiki.kiwix.org/wiki/Content_in_all_languages .
    
    If they average 1MB each, then you would need to shell out some decent cash
    
    Unfortunately, I don’t have “decent cash” to shell out. I’ve seem some setups at /r/DataHoarder that I would be extremely happy to ever own, but don’t expect to until typical HDs are an order of magnitude or two bigger than today’s. By which time I expect people will have come up with brand-new forms of data to fill the things with. :)
    
    (Also, nice username. :) )
    
    It’s not just a nom-de-net, it’s a way of life. :)
- Houshalter Sep 14, 2016, 3:04 PM
  0 points
  Parent
  Ah, data hoarding. This is a subject that interests me for multiple reasons. I think preserving humanity’s knowledge is important to start with. But I also like to have local copies of things in case of emergency or just a regular internet outage.
  
  You mentioned wikipedia. I found it takes a long time to download, and viewing it is difficult.
  
  I am working on a scraper for lesswrong. I already downloaded all the html of every post, but I need to parse it into a machine readable format, and then I will publish it as a torrent.
  
  All reddit comments ever are available. I don’t really know what the utility of this is, I’m mostly interested in this stuff for machine learning. But I have found that reddit comments are fantastic for answering questions that wikipedia might not be able to answer, not to mention multiple lifetimes of reading material. I once had an IRC bot that would answer questions by searching askreddit, and it was fairly effective for many types of questions. Similarly it might be worth scraping other social media sites such as hacker news.
  
  I find a torrent for “reddit’s favorite books” which contains hundreds of books people recommended on reddit. It may be worth downloading say all books that have ever appeared on a best sellers list. But one would need to have such a list and how to scrape libgen, which I haven’t looked into yet.
  
  Various textbooks are available through torrent sites or the library genesis. These contain knowledge in a format better than wikipedia, I think. Also scientific papers.
  
  The problem with this is that many books and especially papers and textbooks, are distributed in weird formats like pdf or even postscript. These formats are awful and don’t compress well.
  
  The fantastic thing about text data is that it’s so small, compared to images or video. And it compresses super well. You can store multiple libraries worth of text in a cheapish hard drive.
  
  But pdfs store tons of data as overhead. Just converting them to text might be possible. But that fails terribly on math or anything that isn’t english text. Especially graphs which are important I think. OCR has tons of errors. I’d love to someday have a local archive of all of humanity’s knowledge with almost every book and paper ever published, but it would require solving this problem.
  
  Then perhaps it would be possible to store the data on nickel plates that will last up to 10,000 years. One website is doing that to all of their data. Which is crazy because it’s mostly images too. There is no information on the total storage space, but they do say “Ten thousand standard letter-sized sheets of text or more could fit onto a 2.2-inch diameter nickel plate”, which seems like a lot.
  - Douglas_Knight Sep 16, 2016, 11:57 PM
    0 points
    Parent
    Maybe there is good info in reddit comments, but how do you find it? google? Maybe if you restrict to askreddit it is tractable. Did your bot do its own searching?
    - Houshalter Sep 17, 2016, 10:21 AM
      2 points
      Parent
      My IRC bot used reddit’s own search api, but restricted to a handful of subreddits like eli5, askscience, and askreddit. I also used a bit of machine learning to improve the results a bit, by predicting whether or not a post would have a good answer. It was based on just a few simple features like the number of n-grams that matched in the title, the body, and the number of votes, etc.
      
      It was on the #lesswrong irc for some time and people loved to play with it, until eventually a fun hating op muted it.
      
      Sample conversation: https://i.imgur.com/LDD9isL.jpg
  - DataPacRat Sep 14, 2016, 5:58 PM
    0 points
    Parent
    
    I am working on a scraper for lesswrong. I already downloaded all the html of every post, but I need to parse it into a machine readable format, and then I will publish it as a torrent.
    
    I think that’ll be worth at least a Discussion post when you publish it, for those of us who don’t keep track of every comment. :)
    
    (Will you be including OvercomingBias?)
    
    But I also like to have local copies of things in case of emergency or just a regular internet outage.
    
    I’ve found a torrent of public-domain “survival books” of which at least some may interest you; unfortunately, LW doesn’t seem to want to let me embed the magnet URL, so I’ll try just pasting it: magnet:?xt=urn:btih:57963b66246379aa3c10d84a5de92c0ab5173faf&dn=SurvivalLibrary&tr=http%3a%2f%2ftracker.tfile.me%3a80%2fannounce&tr=http%3a%2f%2fpow7.com%3a80%2fannounce&tr=http%3a%2f%2ftracker.pow7.com%2fannounce&tr=http%3a%2f%2ftorrent.gresille.org%3a80%2fannounce&tr=http%3a%2f%2fp4p.arenabg.ch%3a1337%2fannounce&tr=http%3a%2f%2fretracker.krs-ix.ru%2fannounce&tr=http%3a%2f%2fmgtracker.org%3a2710%2fannounce&tr=http%3a%2f%2ftracker.dutchtracking.nl%3a80%2fannounce&tr=http%3a%2f%2fshare.camoe.cn%3a8080%2fannounce&tr=http%3a%2f%2ftracker.dutchtracking.com%3a80%2fannounce&tr=http%3a%2f%2fexplodie.org%3a6969%2fannounce&tr=http%3a%2f%2ftorrent.gresille.org%2fannounce&tr=http%3a%2f%2fretracker.krs-ix.ru%3a80%2fannounce&tr=http%3a%2f%2ft1.pow7.com%2fannounce&tr=http%3a%2f%2fpow7.com%2fannounce&tr=http%3a%2f%2fsecure.pow7.com%2fannounce&tr=http%3a%2f%2ftracker.tfile.me%2fannounce&tr=http%3a%2f%2fatrack.pow7.com%3a80%2fannounce&tr=http%3a%2f%2fextremlymtorrents.me%2fannounce.php&tr=http%3a%2f%2finferno.demonoid.me%3a3414%2fannounce&tr=http%3a%2f%2ftorrentsmd.com%3a8080%2fannounce&tr=udp%3a%2f%2fopen.facedatabg.net%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337&tr=udp%3a%2f%2fthetracker.org%3a80&tr=udp%3a%2f%2f9.rarbg.to%3a2710&tr=udp%3a%2f%2f9.rarbg.me%3a2710%2fannounce&tr=udp%3a%2f%2f9.rarbg.to%3a2710%2fannounce&tr=udp%3a%2f%2f9.rarbg.me%3a2710&tr=udp%3a%2f%2fopen.facedatabg.net%3a6969&tr=udp%3a%2f%2ftracker.ex.ua%3a80%2fannounce&tr=udp%3a%2f%2finferno.demonoid.com%3a3411%2fannounce&tr=udp%3a%2f%2finferno.demonoid.ph%3a3389%2fannounce&tr=udp%3a%2f%2f9.rarbg.com%3a2710%2fannounce&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.coppersurfer.tk%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.ilibr.org%3a6969%2fannounce&tr=udp%3a%2f%2fzer0day.ch%3a1337%2fannounce&tr=udp%3a%2f%2fwww.eddie4.nl%3a6969%2fannounce&tr=udp%3a%2f%2ftorrent.gresille.org%3a80%2fannounce&tr=udp%3a%2f%2fp4p.arenabg.ch%3a1337%2fannounce&tr=udp%3a%2f%2fp4p.arenabg.com%3a1337%2fannounce&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969&tr=udp%3a%2f%2ftracker.kicks-ass.net%3a80%2fannounce&tr=udp%3a%2f%2ftracker.tiny-vps.com%3a6969%2fannounce&tr=udp%3a%2f%2f91.218.230.81%3a6969%2fannounce&tr=udp%3a%2f%2f168.235.67.63%3a6969%2fannounce&tr=udp%3a%2f%2fexplodie.org%3a6969%2fannounce&tr=udp%3a%2f%2feddie4.nl%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.coppersurfer.tk%3a6969&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337%2fannounce&tr=udp%3a%2f%2ftracker.aletorrenty.pl%3a2710%2fannounce&tr=http%3a%2f%2ftracker.dler.org%3a6969%2fannounce
    - Houshalter Sep 15, 2016, 1:14 AM
      0 points
      Parent
      Yes if I finish it I will make a discussion post for it. I didn’t plan on including overcoming bias, but that could be done.
      
      If you put 4 spaces before it you can make it a codeblock which should fix it:
      
      magnet:?xt=urn:btih:57963b66246379aa3c10d84a5de92c0ab5173faf&dn=SurvivalLibrary&tr=http%3a%2f%2ftracker.tfile.me%3a80%2fannounce&tr=http%3a%2f%2fpow7.com%3a80%2fannounce&tr=http%3a%2f%2ftracker.pow7.com%2fannounce&tr=http%3a%2f%2ftorrent.gresille.org%3a80%2fannounce&tr=http%3a%2f%2fp4p.arenabg.ch%3a1337%2fannounce&tr=http%3a%2f%2fretracker.krs-ix.ru%2fannounce&tr=http%3a%2f%2fmgtracker.org%3a2710%2fannounce&tr=http%3a%2f%2ftracker.dutchtracking.nl%3a80%2fannounce&tr=http%3a%2f%2fshare.camoe.cn%3a8080%2fannounce&tr=http%3a%2f%2ftracker.dutchtracking.com%3a80%2fannounce&tr=http%3a%2f%2fexplodie.org%3a6969%2fannounce&tr=http%3a%2f%2ftorrent.gresille.org%2fannounce&tr=http%3a%2f%2fretracker.krs-ix.ru%3a80%2fannounce&tr=http%3a%2f%2ft1.pow7.com%2fannounce&tr=http%3a%2f%2fpow7.com%2fannounce&tr=http%3a%2f%2fsecure.pow7.com%2fannounce&tr=http%3a%2f%2ftracker.tfile.me%2fannounce&tr=http%3a%2f%2fatrack.pow7.com%3a80%2fannounce&tr=http%3a%2f%2fextremlymtorrents.me%2fannounce.php&tr=http%3a%2f%2finferno.demonoid.me%3a3414%2fannounce&tr=http%3a%2f%2ftorrentsmd.com%3a8080%2fannounce&tr=udp%3a%2f%2fopen.facedatabg.net%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337&tr=udp%3a%2f%2fthetracker.org%3a80&tr=udp%3a%2f%2f9.rarbg.to%3a2710&tr=udp%3a%2f%2f9.rarbg.me%3a2710%2fannounce&tr=udp%3a%2f%2f9.rarbg.to%3a2710%2fannounce&tr=udp%3a%2f%2f9.rarbg.me%3a2710&tr=udp%3a%2f%2fopen.facedatabg.net%3a6969&tr=udp%3a%2f%2ftracker.ex.ua%3a80%2fannounce&tr=udp%3a%2f%2finferno.demonoid.com%3a3411%2fannounce&tr=udp%3a%2f%2finferno.demonoid.ph%3a3389%2fannounce&tr=udp%3a%2f%2f9.rarbg.com%3a2710%2fannounce&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.coppersurfer.tk%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.ilibr.org%3a6969%2fannounce&tr=udp%3a%2f%2fzer0day.ch%3a1337%2fannounce&tr=udp%3a%2f%2fwww.eddie4.nl%3a6969%2fannounce&tr=udp%3a%2f%2ftorrent.gresille.org%3a80%2fannounce&tr=udp%3a%2f%2fp4p.arenabg.ch%3a1337%2fannounce&tr=udp%3a%2f%2fp4p.arenabg.com%3a1337%2fannounce&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969&tr=udp%3a%2f%2ftracker.kicks-ass.net%3a80%2fannounce&tr=udp%3a%2f%2ftracker.tiny-vps.com%3a6969%2fannounce&tr=udp%3a%2f%2f91.218.230.81%3a6969%2fannounce&tr=udp%3a%2f%2f168.235.67.63%3a6969%2fannounce&tr=udp%3a%2f%2fexplodie.org%3a6969%2fannounce&tr=udp%3a%2f%2feddie4.nl%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.coppersurfer.tk%3a6969&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337%2fannounce&tr=udp%3a%2f%2ftracker.aletorrenty.pl%3a2710%2fannounce&tr=http%3a%2f%2ftracker.dler.org%3a6969%2fannounce
- Lumifer Sep 13, 2016, 4:24 AM
  0 points
  Parent
  Data-recovery insurance is called “a backup”.
  
  There is not much need for me to have copies of information off the ’net. The exceptions are music/movies/books which I don’t bother to backup (I just want to have them locally and know where to find them if my local storage dies) and a variety of interesting to me titbits which fit into Evernote quite well. I don’t see any point in having a local copy of the Wikipedia and/or Project Gutenberg.