This is a cool idea; thanks for creating the bounty.
I probably won’t have time to attempt this myself, but my guess is someone who simply follows the langchain example here: https://github.com/hwchase17/chat-langchain can get pretty far, if they choose the right set of documents!
For non-sequence posts, there’s a small obstacle: LW’s terms specifically prohibit “Using a spider, scraper, or other automated technology to access the Website;”
Not sure if Arbital, AF, etc. have similar restrictions, though it might suffice to just save the most important posts and papers by hand. (In fact, that might even produce better results—probably there are a lot of bad alignment takes on LW that should be excluded from the index anyway.)
This is a cool idea; thanks for creating the bounty.
I probably won’t have time to attempt this myself, but my guess is someone who simply follows the langchain example here: https://github.com/hwchase17/chat-langchain can get pretty far, if they choose the right set of documents!
Note, I believe the sequences themselves can be easily (and permissibly) scraped from here: https://www.readthesequences.com/
For non-sequence posts, there’s a small obstacle: LW’s terms specifically prohibit “Using a spider, scraper, or other automated technology to access the Website;”
(https://docs.google.com/viewer?url=https%3A%2F%2Fintelligence.org%2Ffiles%2FPrivacyandTerms-Lesswrong.com.pdf)
Not sure if Arbital, AF, etc. have similar restrictions, though it might suffice to just save the most important posts and papers by hand. (In fact, that might even produce better results—probably there are a lot of bad alignment takes on LW that should be excluded from the index anyway.)
It should be possible to ask content owners for permission and get pretty far with that.