if there is no official lesswrong db/site archive for public posts, i’d like to be able to create my own with automated tools like wget, so that i can browse the site while offline. see Is there a lesswrong archive of all public posts?
wget and curl logs:
$ wget -mk https://www.lesswrong.com/ --2023-11-08 14:31:26-- https://www.lesswrong.com/ Loaded CA certificate ‘/etc/ssl/certs/ca-certificates.crt’ Resolving www.lesswrong.com (www.lesswrong.com)… 54.90.19.223, 44.213.228.21, 54.81.2.129 Connecting to www.lesswrong.com (www.lesswrong.com)|54.90.19.223|:443… connected. HTTP request sent, awaiting response… 403 Forbidden 2023-11-08 14:31:26 ERROR 403: Forbidden. Converted links in 0 files in 0 seconds. $ curl -Lv https://www.lesswrong.com/ * Trying 54.81.2.129:443… * Connected to www.lesswrong.com (54.81.2.129) port 443 * ALPN: curl offers h2,http/1.1 * TLSv1.3 (OUT), TLS handshake, Client hello (1): * CAfile: /etc/ssl/certs/ca-certificates.crt * CApath: none * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.2 (IN), TLS handshake, Certificate (11): * TLSv1.2 (IN), TLS handshake, Server key exchange (12): * TLSv1.2 (IN), TLS handshake, Server finished (14): * TLSv1.2 (OUT), TLS handshake, Client key exchange (16): * TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.2 (OUT), TLS handshake, Finished (20): * TLSv1.2 (IN), TLS handshake, Finished (20): * SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256 * ALPN: server accepted h2 * Server certificate: * subject: CN=lesswrong.com * start date: Sep 8 00:00:00 2023 GMT * expire date: Oct 6 23:59:59 2024 GMT * subjectAltName: host ”www.lesswrong.com″ matched cert’s ”www.lesswrong.com″ * issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M02 * SSL certificate verify ok. * using HTTP/2 * [HTTP/2] [1] OPENED stream for https://www.lesswrong.com/ * [HTTP/2] [1] [:method: GET] * [HTTP/2] [1] [:scheme: https] * [HTTP/2] [1] [:authority: www.lesswrong.com] * [HTTP/2] [1] [:path: /] * [HTTP/2] [1] [user-agent: curl/8.4.0] * [HTTP/2] [1] [accept: */*] > GET / HTTP/2 > Host: www.lesswrong.com > User-Agent: curl/8.4.0 > Accept: */* > < HTTP/2 403 < server: awselb/2.0 < date: Wed, 08 Nov 2023 19:31:44 GMT < content-type: text/html < content-length: 118 < <html> <head><title>403 Forbidden</title></head> <body> <center><h1>403 Forbidden</h1></center> </body> </html> * Connection #0 to host www.lesswrong.com left intact
[Question] Why is lesswrong blocking wget and curl (scrape)?
if there is no official lesswrong db/site archive for public posts, i’d like to be able to create my own with automated tools like wget, so that i can browse the site while offline. see Is there a lesswrong archive of all public posts?
wget and curl logs: