They do have a good reason to be wary of scrapers as they provide a free version of ChatGPT, I’m guessing they just went ahead and configured it over their entire domain name rather than restricting it to the chat subdomain.
They could but if you’re managing your firewall it’s easier to apply a blanket rule rather than trying to divide things by subdomain, unless you have a good reason to do otherwise. I wouldn’t assume malicious intent.
Sorry, I might be missing something: subdomains are subdomain.domain.com, whereas ChatGPT.com is a unique top-level domain, right? In either case, I’m sure there are benefits to doing things consistently — both may be on the same server, subject to the same attacks, beholden to the same internal infosec policies, etc.
So I do believe they have their own private reasons for it. Didn’t mean to imply that they’ve maliciously done this to prevent some random internet guy’s change tracking or anything. But I do wish they would walk it back on the openai.com pages, or at least in their terms of use. It’s hypocritcal, in my opinion, that they are so cautious about automated access to their own site while relying on such access so completely from other sites. Feels similar to when they tried to press copyright claims against the ChatGPT subreddit. Sure, it’s in their interest for potentially nontrivial reasons, but it also highlights how weird and self-serving the current paradigm (and their justifications for it) are.
Hm, are you sure they’re actually that protective against scrapers? I ran a quick script and was able to extract all 548 unique pages just fine: https://pastebin.com/B824Hk8J The final output was:
Status codes encountered:
200: 548
404: 20
I reran it two more times, it still worked. I’m using a regular residential IP address, no fancy proxies. Maybe you’re just missing the code to refresh the cookies (included in my script)? I’m probably missing something of course, just curious why the scraping seems to be easy enough from my machine?
Ooh this is useful for me. The pastebin link appears broken—any chance you can verify it?
I defintiely get 403s and captchas pretty reliably for OpenAI and OpenAI alone (and notably not google, meta, anthropic, etc.) with an instance based on https://github.com/dgtlmoon/changedetection.io. Will have to look into cookie refreshing. I have had some success with randomizing IPs, but maybe I don’t have the cookies sorted.
I’ve used both data center and rotating residential proxies :/ But I am running it on the cloud. Your results are promising so I’m going to see how an OpenAI-specific one run locally works for me, or else a new proxy provider.
They do have a good reason to be wary of scrapers as they provide a free version of ChatGPT, I’m guessing they just went ahead and configured it over their entire domain name rather than restricting it to the chat subdomain.
ChatGPT is only accessible for free via chatgpt.com, right? Seems like it shouldn’t be too hard to restrict it to that.
They could but if you’re managing your firewall it’s easier to apply a blanket rule rather than trying to divide things by subdomain, unless you have a good reason to do otherwise. I wouldn’t assume malicious intent.
Sorry, I might be missing something: subdomains are subdomain.domain.com, whereas ChatGPT.com is a unique top-level domain, right? In either case, I’m sure there are benefits to doing things consistently — both may be on the same server, subject to the same attacks, beholden to the same internal infosec policies, etc.
So I do believe they have their own private reasons for it. Didn’t mean to imply that they’ve maliciously done this to prevent some random internet guy’s change tracking or anything. But I do wish they would walk it back on the openai.com pages, or at least in their terms of use. It’s hypocritcal, in my opinion, that they are so cautious about automated access to their own site while relying on such access so completely from other sites. Feels similar to when they tried to press copyright claims against the ChatGPT subreddit. Sure, it’s in their interest for potentially nontrivial reasons, but it also highlights how weird and self-serving the current paradigm (and their justifications for it) are.
Hm, are you sure they’re actually that protective against scrapers? I ran a quick script and was able to extract all 548 unique pages just fine: https://pastebin.com/B824Hk8J The final output was:
I reran it two more times, it still worked. I’m using a regular residential IP address, no fancy proxies. Maybe you’re just missing the code to refresh the cookies (included in my script)? I’m probably missing something of course, just curious why the scraping seems to be easy enough from my machine?
Ooh this is useful for me. The pastebin link appears broken—any chance you can verify it?
I defintiely get 403s and captchas pretty reliably for OpenAI and OpenAI alone (and notably not google, meta, anthropic, etc.) with an instance based on https://github.com/dgtlmoon/changedetection.io. Will have to look into cookie refreshing. I have had some success with randomizing IPs, but maybe I don’t have the cookies sorted.
Here’s the corrected link: https://pastebin.com/B824Hk8J
Are you running this from an EC2 instance or some other cloud provider? They might just have a blocklist in IPs belonging to data centers.
I’ve used both data center and rotating residential proxies :/ But I am running it on the cloud. Your results are promising so I’m going to see how an OpenAI-specific one run locally works for me, or else a new proxy provider.
Thanks again for looking into this.