The result, then, is districts like Mason City asking ChatGPT, “Does [insert book here] contain a description or depiction of a sex act?” If the answer was yes, the book was removed from the district’s libraries and stored.
Regarding China or other regimes using LLMs for censorship, I’m actually concerned that it might rapidly go the opposite direction as speculated here:
It has widely been reported that the PRC may be hesitant to deploy public-facing LLMs due to concerns that the models themselves can’t be adequately censored—it might be very difficult to make a version of ChatGPT that cannot be tricked into saying “4/6/89.”
In principle it should be possible to completely delete certain facts from the training set of an LLM. A static text dataset is easier to audit than the ever-changing content of the internet. If the government requires companies building LLMs to vet their training datasets—or perhaps even requires everyone to contribute the data they want to include into a centralized approved repository—perhaps it could exert more control over what facts are available to the population.
It’s essentially impossible to block all undesired web content with the Great Firewall of China, as so much new content is constantly being created; instead as I understand it they take a more probabilistic approach to detection/deterrence. But this isn’t necessarily true for LLMs. I could see a world where Google-like search UIs are significantly displaced by each individual having a conversation with a government-approved LLM, and that gives the government much more power to control what information is available to be discovered.
A possible limiting factor is that you can’t get up-to-date news from an LLM, since it only knows about what’s in the training data. But there are knowledge-retrieval architectures that can get around that limitation at least to some degree. So the question is whether the CCP could build an LLM that’s good enough that people wouldn’t revolt if the internet was blocked and replaced by it (of course this would occur gradually).
I think these are great points. Entirely possible that a really good appropriately censored LLM becomes a big part of China’s public-facing internet.
On the article about Iowa schools, I looked into this a little bit while writing this and as far as I could see rather than running GPT over the full text and asking about the content like what I was approximating, they are instead literally just prompting it with “Does [book X] contain a sex scene?” and taking the first completion as the truth. This to me seems like not a very good way of determining whether books contain objectionable content, but is evidence that bureaucratic organs like outsourcing decisions to opaque knowledge-producers like LLMs whether or not they are effective.
Amusingly, the US seems to have already taken this approach to censor books: https://www.wired.com/story/chatgpt-ban-books-iowa-schools-sf-496/
Regarding China or other regimes using LLMs for censorship, I’m actually concerned that it might rapidly go the opposite direction as speculated here:
In principle it should be possible to completely delete certain facts from the training set of an LLM. A static text dataset is easier to audit than the ever-changing content of the internet. If the government requires companies building LLMs to vet their training datasets—or perhaps even requires everyone to contribute the data they want to include into a centralized approved repository—perhaps it could exert more control over what facts are available to the population.
It’s essentially impossible to block all undesired web content with the Great Firewall of China, as so much new content is constantly being created; instead as I understand it they take a more probabilistic approach to detection/deterrence. But this isn’t necessarily true for LLMs. I could see a world where Google-like search UIs are significantly displaced by each individual having a conversation with a government-approved LLM, and that gives the government much more power to control what information is available to be discovered.
A possible limiting factor is that you can’t get up-to-date news from an LLM, since it only knows about what’s in the training data. But there are knowledge-retrieval architectures that can get around that limitation at least to some degree. So the question is whether the CCP could build an LLM that’s good enough that people wouldn’t revolt if the internet was blocked and replaced by it (of course this would occur gradually).
I think these are great points. Entirely possible that a really good appropriately censored LLM becomes a big part of China’s public-facing internet.
On the article about Iowa schools, I looked into this a little bit while writing this and as far as I could see rather than running GPT over the full text and asking about the content like what I was approximating, they are instead literally just prompting it with “Does [book X] contain a sex scene?” and taking the first completion as the truth. This to me seems like not a very good way of determining whether books contain objectionable content, but is evidence that bureaucratic organs like outsourcing decisions to opaque knowledge-producers like LLMs whether or not they are effective.