Ah, data hoarding. This is a subject that interests me for multiple reasons. I think preserving humanity’s knowledge is important to start with. But I also like to have local copies of things in case of emergency or just a regular internet outage.
You mentioned wikipedia. I found it takes a long time to download, and viewing it is difficult.
I am working on a scraper for lesswrong. I already downloaded all the html of every post, but I need to parse it into a machine readable format, and then I will publish it as a torrent.
All reddit comments ever are available. I don’t really know what the utility of this is, I’m mostly interested in this stuff for machine learning. But I have found that reddit comments are fantastic for answering questions that wikipedia might not be able to answer, not to mention multiple lifetimes of reading material. I once had an IRC bot that would answer questions by searching askreddit, and it was fairly effective for many types of questions. Similarly it might be worth scraping other social media sites such as hacker news.
I find a torrent for “reddit’s favorite books” which contains hundreds of books people recommended on reddit. It may be worth downloading say all books that have ever appeared on a best sellers list. But one would need to have such a list and how to scrape libgen, which I haven’t looked into yet.
Various textbooks are available through torrent sites or the library genesis. These contain knowledge in a format better than wikipedia, I think. Also scientific papers.
The problem with this is that many books and especially papers and textbooks, are distributed in weird formats like pdf or even postscript. These formats are awful and don’t compress well.
The fantastic thing about text data is that it’s so small, compared to images or video. And it compresses super well. You can store multiple libraries worth of text in a cheapish hard drive.
But pdfs store tons of data as overhead. Just converting them to text might be possible. But that fails terribly on math or anything that isn’t english text. Especially graphs which are important I think. OCR has tons of errors. I’d love to someday have a local archive of all of humanity’s knowledge with almost every book and paper ever published, but it would require solving this problem.
Then perhaps it would be possible to store the data on nickel plates that will last up to 10,000 years. One website is doing that to all of their data. Which is crazy because it’s mostly images too. There is no information on the total storage space, but they do say “Ten thousand standard letter-sized sheets of text or more could fit onto a 2.2-inch diameter nickel plate”, which seems like a lot.
Maybe there is good info in reddit comments, but how do you find it? google? Maybe if you restrict to askreddit it is tractable. Did your bot do its own searching?
My IRC bot used reddit’s own search api, but restricted to a handful of subreddits like eli5, askscience, and askreddit. I also used a bit of machine learning to improve the results a bit, by predicting whether or not a post would have a good answer. It was based on just a few simple features like the number of n-grams that matched in the title, the body, and the number of votes, etc.
It was on the #lesswrong irc for some time and people loved to play with it, until eventually a fun hating op muted it.
I am working on a scraper for lesswrong. I already downloaded all the html of every post, but I need to parse it into a machine readable format, and then I will publish it as a torrent.
I think that’ll be worth at least a Discussion post when you publish it, for those of us who don’t keep track of every comment. :)
(Will you be including OvercomingBias?)
But I also like to have local copies of things in case of emergency or just a regular internet outage.
I’ve found a torrent of public-domain “survival books” of which at least some may interest you; unfortunately, LW doesn’t seem to want to let me embed the magnet URL, so I’ll try just pasting it:
magnet:?xt=urn:btih:57963b66246379aa3c10d84a5de92c0ab5173faf&dn=SurvivalLibrary&tr=http%3a%2f%2ftracker.tfile.me%3a80%2fannounce&tr=http%3a%2f%2fpow7.com%3a80%2fannounce&tr=http%3a%2f%2ftracker.pow7.com%2fannounce&tr=http%3a%2f%2ftorrent.gresille.org%3a80%2fannounce&tr=http%3a%2f%2fp4p.arenabg.ch%3a1337%2fannounce&tr=http%3a%2f%2fretracker.krs-ix.ru%2fannounce&tr=http%3a%2f%2fmgtracker.org%3a2710%2fannounce&tr=http%3a%2f%2ftracker.dutchtracking.nl%3a80%2fannounce&tr=http%3a%2f%2fshare.camoe.cn%3a8080%2fannounce&tr=http%3a%2f%2ftracker.dutchtracking.com%3a80%2fannounce&tr=http%3a%2f%2fexplodie.org%3a6969%2fannounce&tr=http%3a%2f%2ftorrent.gresille.org%2fannounce&tr=http%3a%2f%2fretracker.krs-ix.ru%3a80%2fannounce&tr=http%3a%2f%2ft1.pow7.com%2fannounce&tr=http%3a%2f%2fpow7.com%2fannounce&tr=http%3a%2f%2fsecure.pow7.com%2fannounce&tr=http%3a%2f%2ftracker.tfile.me%2fannounce&tr=http%3a%2f%2fatrack.pow7.com%3a80%2fannounce&tr=http%3a%2f%2fextremlymtorrents.me%2fannounce.php&tr=http%3a%2f%2finferno.demonoid.me%3a3414%2fannounce&tr=http%3a%2f%2ftorrentsmd.com%3a8080%2fannounce&tr=udp%3a%2f%2fopen.facedatabg.net%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337&tr=udp%3a%2f%2fthetracker.org%3a80&tr=udp%3a%2f%2f9.rarbg.to%3a2710&tr=udp%3a%2f%2f9.rarbg.me%3a2710%2fannounce&tr=udp%3a%2f%2f9.rarbg.to%3a2710%2fannounce&tr=udp%3a%2f%2f9.rarbg.me%3a2710&tr=udp%3a%2f%2fopen.facedatabg.net%3a6969&tr=udp%3a%2f%2ftracker.ex.ua%3a80%2fannounce&tr=udp%3a%2f%2finferno.demonoid.com%3a3411%2fannounce&tr=udp%3a%2f%2finferno.demonoid.ph%3a3389%2fannounce&tr=udp%3a%2f%2f9.rarbg.com%3a2710%2fannounce&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.coppersurfer.tk%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.ilibr.org%3a6969%2fannounce&tr=udp%3a%2f%2fzer0day.ch%3a1337%2fannounce&tr=udp%3a%2f%2fwww.eddie4.nl%3a6969%2fannounce&tr=udp%3a%2f%2ftorrent.gresille.org%3a80%2fannounce&tr=udp%3a%2f%2fp4p.arenabg.ch%3a1337%2fannounce&tr=udp%3a%2f%2fp4p.arenabg.com%3a1337%2fannounce&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969&tr=udp%3a%2f%2ftracker.kicks-ass.net%3a80%2fannounce&tr=udp%3a%2f%2ftracker.tiny-vps.com%3a6969%2fannounce&tr=udp%3a%2f%2f91.218.230.81%3a6969%2fannounce&tr=udp%3a%2f%2f168.235.67.63%3a6969%2fannounce&tr=udp%3a%2f%2fexplodie.org%3a6969%2fannounce&tr=udp%3a%2f%2feddie4.nl%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.coppersurfer.tk%3a6969&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337%2fannounce&tr=udp%3a%2f%2ftracker.aletorrenty.pl%3a2710%2fannounce&tr=http%3a%2f%2ftracker.dler.org%3a6969%2fannounce
Ah, data hoarding. This is a subject that interests me for multiple reasons. I think preserving humanity’s knowledge is important to start with. But I also like to have local copies of things in case of emergency or just a regular internet outage.
You mentioned wikipedia. I found it takes a long time to download, and viewing it is difficult.
I am working on a scraper for lesswrong. I already downloaded all the html of every post, but I need to parse it into a machine readable format, and then I will publish it as a torrent.
All reddit comments ever are available. I don’t really know what the utility of this is, I’m mostly interested in this stuff for machine learning. But I have found that reddit comments are fantastic for answering questions that wikipedia might not be able to answer, not to mention multiple lifetimes of reading material. I once had an IRC bot that would answer questions by searching askreddit, and it was fairly effective for many types of questions. Similarly it might be worth scraping other social media sites such as hacker news.
I find a torrent for “reddit’s favorite books” which contains hundreds of books people recommended on reddit. It may be worth downloading say all books that have ever appeared on a best sellers list. But one would need to have such a list and how to scrape libgen, which I haven’t looked into yet.
Various textbooks are available through torrent sites or the library genesis. These contain knowledge in a format better than wikipedia, I think. Also scientific papers.
The problem with this is that many books and especially papers and textbooks, are distributed in weird formats like pdf or even postscript. These formats are awful and don’t compress well.
The fantastic thing about text data is that it’s so small, compared to images or video. And it compresses super well. You can store multiple libraries worth of text in a cheapish hard drive.
But pdfs store tons of data as overhead. Just converting them to text might be possible. But that fails terribly on math or anything that isn’t english text. Especially graphs which are important I think. OCR has tons of errors. I’d love to someday have a local archive of all of humanity’s knowledge with almost every book and paper ever published, but it would require solving this problem.
Then perhaps it would be possible to store the data on nickel plates that will last up to 10,000 years. One website is doing that to all of their data. Which is crazy because it’s mostly images too. There is no information on the total storage space, but they do say “Ten thousand standard letter-sized sheets of text or more could fit onto a 2.2-inch diameter nickel plate”, which seems like a lot.
Maybe there is good info in reddit comments, but how do you find it? google? Maybe if you restrict to askreddit it is tractable. Did your bot do its own searching?
My IRC bot used reddit’s own search api, but restricted to a handful of subreddits like eli5, askscience, and askreddit. I also used a bit of machine learning to improve the results a bit, by predicting whether or not a post would have a good answer. It was based on just a few simple features like the number of n-grams that matched in the title, the body, and the number of votes, etc.
It was on the #lesswrong irc for some time and people loved to play with it, until eventually a fun hating op muted it.
Sample conversation: https://i.imgur.com/LDD9isL.jpg
I think that’ll be worth at least a Discussion post when you publish it, for those of us who don’t keep track of every comment. :)
(Will you be including OvercomingBias?)
I’ve found a torrent of public-domain “survival books” of which at least some may interest you; unfortunately, LW doesn’t seem to want to let me embed the magnet URL, so I’ll try just pasting it: magnet:?xt=urn:btih:57963b66246379aa3c10d84a5de92c0ab5173faf&dn=SurvivalLibrary&tr=http%3a%2f%2ftracker.tfile.me%3a80%2fannounce&tr=http%3a%2f%2fpow7.com%3a80%2fannounce&tr=http%3a%2f%2ftracker.pow7.com%2fannounce&tr=http%3a%2f%2ftorrent.gresille.org%3a80%2fannounce&tr=http%3a%2f%2fp4p.arenabg.ch%3a1337%2fannounce&tr=http%3a%2f%2fretracker.krs-ix.ru%2fannounce&tr=http%3a%2f%2fmgtracker.org%3a2710%2fannounce&tr=http%3a%2f%2ftracker.dutchtracking.nl%3a80%2fannounce&tr=http%3a%2f%2fshare.camoe.cn%3a8080%2fannounce&tr=http%3a%2f%2ftracker.dutchtracking.com%3a80%2fannounce&tr=http%3a%2f%2fexplodie.org%3a6969%2fannounce&tr=http%3a%2f%2ftorrent.gresille.org%2fannounce&tr=http%3a%2f%2fretracker.krs-ix.ru%3a80%2fannounce&tr=http%3a%2f%2ft1.pow7.com%2fannounce&tr=http%3a%2f%2fpow7.com%2fannounce&tr=http%3a%2f%2fsecure.pow7.com%2fannounce&tr=http%3a%2f%2ftracker.tfile.me%2fannounce&tr=http%3a%2f%2fatrack.pow7.com%3a80%2fannounce&tr=http%3a%2f%2fextremlymtorrents.me%2fannounce.php&tr=http%3a%2f%2finferno.demonoid.me%3a3414%2fannounce&tr=http%3a%2f%2ftorrentsmd.com%3a8080%2fannounce&tr=udp%3a%2f%2fopen.facedatabg.net%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337&tr=udp%3a%2f%2fthetracker.org%3a80&tr=udp%3a%2f%2f9.rarbg.to%3a2710&tr=udp%3a%2f%2f9.rarbg.me%3a2710%2fannounce&tr=udp%3a%2f%2f9.rarbg.to%3a2710%2fannounce&tr=udp%3a%2f%2f9.rarbg.me%3a2710&tr=udp%3a%2f%2fopen.facedatabg.net%3a6969&tr=udp%3a%2f%2ftracker.ex.ua%3a80%2fannounce&tr=udp%3a%2f%2finferno.demonoid.com%3a3411%2fannounce&tr=udp%3a%2f%2finferno.demonoid.ph%3a3389%2fannounce&tr=udp%3a%2f%2f9.rarbg.com%3a2710%2fannounce&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.coppersurfer.tk%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.ilibr.org%3a6969%2fannounce&tr=udp%3a%2f%2fzer0day.ch%3a1337%2fannounce&tr=udp%3a%2f%2fwww.eddie4.nl%3a6969%2fannounce&tr=udp%3a%2f%2ftorrent.gresille.org%3a80%2fannounce&tr=udp%3a%2f%2fp4p.arenabg.ch%3a1337%2fannounce&tr=udp%3a%2f%2fp4p.arenabg.com%3a1337%2fannounce&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969&tr=udp%3a%2f%2ftracker.kicks-ass.net%3a80%2fannounce&tr=udp%3a%2f%2ftracker.tiny-vps.com%3a6969%2fannounce&tr=udp%3a%2f%2f91.218.230.81%3a6969%2fannounce&tr=udp%3a%2f%2f168.235.67.63%3a6969%2fannounce&tr=udp%3a%2f%2fexplodie.org%3a6969%2fannounce&tr=udp%3a%2f%2feddie4.nl%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.coppersurfer.tk%3a6969&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337%2fannounce&tr=udp%3a%2f%2ftracker.aletorrenty.pl%3a2710%2fannounce&tr=http%3a%2f%2ftracker.dler.org%3a6969%2fannounce
Yes if I finish it I will make a discussion post for it. I didn’t plan on including overcoming bias, but that could be done.
If you put 4 spaces before it you can make it a codeblock which should fix it: