Hm, are you sure they’re actually that protective against scrapers? I ran a quick script and was able to extract all 548 unique pages just fine: https://pastebin.com/B824Hk8J The final output was:
Status codes encountered:
200: 548
404: 20
I reran it two more times, it still worked. I’m using a regular residential IP address, no fancy proxies. Maybe you’re just missing the code to refresh the cookies (included in my script)? I’m probably missing something of course, just curious why the scraping seems to be easy enough from my machine?
Ooh this is useful for me. The pastebin link appears broken—any chance you can verify it?
I defintiely get 403s and captchas pretty reliably for OpenAI and OpenAI alone (and notably not google, meta, anthropic, etc.) with an instance based on https://github.com/dgtlmoon/changedetection.io. Will have to look into cookie refreshing. I have had some success with randomizing IPs, but maybe I don’t have the cookies sorted.
I’ve used both data center and rotating residential proxies :/ But I am running it on the cloud. Your results are promising so I’m going to see how an OpenAI-specific one run locally works for me, or else a new proxy provider.
Hm, are you sure they’re actually that protective against scrapers? I ran a quick script and was able to extract all 548 unique pages just fine: https://pastebin.com/B824Hk8J The final output was:
I reran it two more times, it still worked. I’m using a regular residential IP address, no fancy proxies. Maybe you’re just missing the code to refresh the cookies (included in my script)? I’m probably missing something of course, just curious why the scraping seems to be easy enough from my machine?
Ooh this is useful for me. The pastebin link appears broken—any chance you can verify it?
I defintiely get 403s and captchas pretty reliably for OpenAI and OpenAI alone (and notably not google, meta, anthropic, etc.) with an instance based on https://github.com/dgtlmoon/changedetection.io. Will have to look into cookie refreshing. I have had some success with randomizing IPs, but maybe I don’t have the cookies sorted.
Here’s the corrected link: https://pastebin.com/B824Hk8J
Are you running this from an EC2 instance or some other cloud provider? They might just have a blocklist in IPs belonging to data centers.
I’ve used both data center and rotating residential proxies :/ But I am running it on the cloud. Your results are promising so I’m going to see how an OpenAI-specific one run locally works for me, or else a new proxy provider.
Thanks again for looking into this.