A (somewhat minor) example of hypocrisy from OpenAI that I find frustrating.
For context: I run an automated system that checks for quiet/unannounced updates to AI companies’ public web content including safety policies, model documentation, acceptable use policies, etc. I also share some findings from this on Twitter.
Part of why I think this is useful is that OpenAI in particular has repeatedly made web changes of this nature without announcing or acknowledging it (e.g. 1, 2, 3, 4, 5, 6). I’m worried that they may continue to make substantive changes to other documents, e.g. their preparedness framework, while hoping it won’t attract attention (even just a few words, like if they one day change a “we will...” to a “we will attempt to...”).
This process requires very minimal bandwidth/requests to the web server (it checks anywhere from once a day to once a month per monitored page).
But letting this system run on OpenAI’s website is complicated as (1) they are incredibly proactive at captcha-walling suspected crawlers (better than any other website I’ve encountered, and I’ve run this on thousands of sites in the past) and (2) their terms of use technically forbid any automated data collection from their website (although it’s unclear whether this is legal/enforceable in the US).
The irony should be immediately obvious — not only is their whole data collection pipeline reliant on web scraping, but they’ve previously gotten in hot water for ignoring other websites’ robots.txt and not complying with the GDPR rules on web scraping. Plus, I’m virtually certain they don’t respect other websites with clauses in the terms of use that forbid automated access. So what makes them so antsy about automated access to their own site?
I wish OpenAI would change one of these behaviors: either stop making quiet, unannounced, and substantive changes to your publicly-released content, or else stop trying so hard to keep automated website monitors from accessing your site to watch for these changes.
They do have a good reason to be wary of scrapers as they provide a free version of ChatGPT, I’m guessing they just went ahead and configured it over their entire domain name rather than restricting it to the chat subdomain.
They could but if you’re managing your firewall it’s easier to apply a blanket rule rather than trying to divide things by subdomain, unless you have a good reason to do otherwise. I wouldn’t assume malicious intent.
Sorry, I might be missing something: subdomains are subdomain.domain.com, whereas ChatGPT.com is a unique top-level domain, right? In either case, I’m sure there are benefits to doing things consistently — both may be on the same server, subject to the same attacks, beholden to the same internal infosec policies, etc.
So I do believe they have their own private reasons for it. Didn’t mean to imply that they’ve maliciously done this to prevent some random internet guy’s change tracking or anything. But I do wish they would walk it back on the openai.com pages, or at least in their terms of use. It’s hypocritcal, in my opinion, that they are so cautious about automated access to their own site while relying on such access so completely from other sites. Feels similar to when they tried to press copyright claims against the ChatGPT subreddit. Sure, it’s in their interest for potentially nontrivial reasons, but it also highlights how weird and self-serving the current paradigm (and their justifications for it) are.
Hm, are you sure they’re actually that protective against scrapers? I ran a quick script and was able to extract all 548 unique pages just fine: https://pastebin.com/B824Hk8J The final output was:
Status codes encountered:
200: 548
404: 20
I reran it two more times, it still worked. I’m using a regular residential IP address, no fancy proxies. Maybe you’re just missing the code to refresh the cookies (included in my script)? I’m probably missing something of course, just curious why the scraping seems to be easy enough from my machine?
Ooh this is useful for me. The pastebin link appears broken—any chance you can verify it?
I defintiely get 403s and captchas pretty reliably for OpenAI and OpenAI alone (and notably not google, meta, anthropic, etc.) with an instance based on https://github.com/dgtlmoon/changedetection.io. Will have to look into cookie refreshing. I have had some success with randomizing IPs, but maybe I don’t have the cookies sorted.
I’ve used both data center and rotating residential proxies :/ But I am running it on the cloud. Your results are promising so I’m going to see how an OpenAI-specific one run locally works for me, or else a new proxy provider.
A (somewhat minor) example of hypocrisy from OpenAI that I find frustrating.
For context: I run an automated system that checks for quiet/unannounced updates to AI companies’ public web content including safety policies, model documentation, acceptable use policies, etc. I also share some findings from this on Twitter.
Part of why I think this is useful is that OpenAI in particular has repeatedly made web changes of this nature without announcing or acknowledging it (e.g. 1, 2, 3, 4, 5, 6). I’m worried that they may continue to make substantive changes to other documents, e.g. their preparedness framework, while hoping it won’t attract attention (even just a few words, like if they one day change a “we will...” to a “we will attempt to...”).
This process requires very minimal bandwidth/requests to the web server (it checks anywhere from once a day to once a month per monitored page).
But letting this system run on OpenAI’s website is complicated as (1) they are incredibly proactive at captcha-walling suspected crawlers (better than any other website I’ve encountered, and I’ve run this on thousands of sites in the past) and (2) their terms of use technically forbid any automated data collection from their website (although it’s unclear whether this is legal/enforceable in the US).
The irony should be immediately obvious — not only is their whole data collection pipeline reliant on web scraping, but they’ve previously gotten in hot water for ignoring other websites’ robots.txt and not complying with the GDPR rules on web scraping. Plus, I’m virtually certain they don’t respect other websites with clauses in the terms of use that forbid automated access. So what makes them so antsy about automated access to their own site?
I wish OpenAI would change one of these behaviors: either stop making quiet, unannounced, and substantive changes to your publicly-released content, or else stop trying so hard to keep automated website monitors from accessing your site to watch for these changes.
They do have a good reason to be wary of scrapers as they provide a free version of ChatGPT, I’m guessing they just went ahead and configured it over their entire domain name rather than restricting it to the chat subdomain.
ChatGPT is only accessible for free via chatgpt.com, right? Seems like it shouldn’t be too hard to restrict it to that.
They could but if you’re managing your firewall it’s easier to apply a blanket rule rather than trying to divide things by subdomain, unless you have a good reason to do otherwise. I wouldn’t assume malicious intent.
Sorry, I might be missing something: subdomains are subdomain.domain.com, whereas ChatGPT.com is a unique top-level domain, right? In either case, I’m sure there are benefits to doing things consistently — both may be on the same server, subject to the same attacks, beholden to the same internal infosec policies, etc.
So I do believe they have their own private reasons for it. Didn’t mean to imply that they’ve maliciously done this to prevent some random internet guy’s change tracking or anything. But I do wish they would walk it back on the openai.com pages, or at least in their terms of use. It’s hypocritcal, in my opinion, that they are so cautious about automated access to their own site while relying on such access so completely from other sites. Feels similar to when they tried to press copyright claims against the ChatGPT subreddit. Sure, it’s in their interest for potentially nontrivial reasons, but it also highlights how weird and self-serving the current paradigm (and their justifications for it) are.
Hm, are you sure they’re actually that protective against scrapers? I ran a quick script and was able to extract all 548 unique pages just fine: https://pastebin.com/B824Hk8J The final output was:
I reran it two more times, it still worked. I’m using a regular residential IP address, no fancy proxies. Maybe you’re just missing the code to refresh the cookies (included in my script)? I’m probably missing something of course, just curious why the scraping seems to be easy enough from my machine?
Ooh this is useful for me. The pastebin link appears broken—any chance you can verify it?
I defintiely get 403s and captchas pretty reliably for OpenAI and OpenAI alone (and notably not google, meta, anthropic, etc.) with an instance based on https://github.com/dgtlmoon/changedetection.io. Will have to look into cookie refreshing. I have had some success with randomizing IPs, but maybe I don’t have the cookies sorted.
Here’s the corrected link: https://pastebin.com/B824Hk8J
Are you running this from an EC2 instance or some other cloud provider? They might just have a blocklist in IPs belonging to data centers.
I’ve used both data center and rotating residential proxies :/ But I am running it on the cloud. Your results are promising so I’m going to see how an OpenAI-specific one run locally works for me, or else a new proxy provider.
Thanks again for looking into this.