We frequently speak about AI capability gain being bad because it shortens the timeframe for AI safety research. In that logic, taking steps to decrease AI capability would be worthwhile.
At the moment the large language models are trained with a lot of data without the company, that trains the language model, licensing the data. If there would be a requirement to license the required data, that would severely reduce the available data for language models and reduce their capabilities.
It’s expensive to fight lawsuits in the United States. Currently, there are artists who feel like their rights are violated by Dalle 2 using their art as training data. Similar to how Thiel funded the Gawker lawsuits, it would be possible to support artists in a suit against OpenAI to require OpenAI to license images for training Dalle 2. If such a lawsuit is well-funded it will be much more likely that a precedent for requiring data licensing gets set which would slow down AI development.
I’m curious about what people who think more about AI safety than myself think about such a move. Would it be helpful?
I’m not sure exactly where I land on this, but I think it’s important to consider that restricting the data companies can train on could influence the architectures they use. Self-supervised autoregressive models a-la GPT-3 seem a lot more benign than full-fledged RL agents. The latter is a lot less data hungry than the former (especially in terms of copyrighted data). There are enough other factors here to not make me completely confident in this analysis, but it’s worth thinking about.
I’m leaning toward the current paradigm being preferable to a full-fledged RL one, but want to add a point—one of my best guesses for proto-AGI involves massive LLMs hooked up to some RL system. This might not require RL capabilities on the same level of complexity as pure RL agents, and RL is still being worked on today.
Agree, but LLM + RL is still preferable to muzero-style AGI.
I agree, but this is a question of timelines too. Within the LLM + RL paradigm we may not need AGI-level RL or LLMs that can accessibly simulate AGI-level simulacra just from self-supervised learning, both of which would take longer than many points requiring intermediate levels of LLM and RL capabilities, because people are still working on RL stuff now.
I’m not a lawyer and this is not legal advice, but I think the current US legal framework isn’t going to work to challenge training on publicly available data.
One argument that something is fair use is that it is transformative [1]. And taking an image or text and using it to slightly influence a giant matrix of numbers, in such a way that the original is not recoverable, and which allows new kinds of expression, seems likely to count as transformative.
So if you think that restricting access to public data for training purposes is a promising approach [2], you should probably focus on trying to create a new regulatory framework.
Having said that, this is all US analysis. Other countries have other frameworks and may not have exact analogs of fair use. Perhaps in the EU legal challenges are more viable.
[1] https://www.nolo.com/legal-encyclopedia/fair-use-what-transformative.html
[2] You should think about what the side effects would be like. For instance, this will advantage giant companies that can pay to license or create data, and places that have low respect for law. Whether that’s desirable is worth thinking through.
From the overview, it seems like there’s one Supreme court case and the other cases are from lower courts.
Transformativeness is only one of the factors in the Supreme court case.
The supreme court case does say:
If you would sue Dalle 2, you would likely argue that Dalle 2 does create market harm by creating competition.
Creating competition doesn’t count as harm—it has to be direct substitution for the work in question. That’s a pretty high bar.
Also there are things like stable diffusion which arguably aren’t commercial (the code and model are free), which further undercuts the commercial purpose angle.
I’m not saying any of this is dispositive—that’s the nature of balancing tests. I think this is going to be a tough row to hoe though, and certainly not a slam dunk to say that copyright should prevent ML training on publicly available data.
(Still not a lawyer, still not legal advice!)
Let’s say an artist draws images in a very unique style. Afterwards, a lot of Dalle 2 images get created in the same style. That would make the style less unique and less valuable.
It’s plausible financial harm in a way that doesn’t exist in the case on which the Supreme Court ruled.
I don’t believe it’s a slam dunk either. I do believe there’s room for the Supreme Court to decide either way. The fact that it’s not a slam dunk either way suggests that spending money on making a stronger legal argument is valuable.
You can’t copyright a style.
It’s already happening https://githubcopilotinvestigation.com/ (which I’ve learned yesterday from is-github-copilot-in-legal-trouble post)
I think it would be interesting plot twist: humanity saved from AI FOOM by the big IT companies having to obey intellectual property rights they themselves defended for so many years :)
If you want to ban or monopolize such models, push for that directly. Indirectly banning them is evil.
They’re already illegal. GPT-3 is based in large part on what appear to be pirated books. (I wonder if google’s models are covered by its settlements with publishers.)
Which is argued to be “transformative” and thus not illegal.
Even if it’s transformative they should still have to buy one license for each book to be able to use it.
Why?
The basic idea of copyright is that if you want to acquire a copy of a book you need to buy that copy. If they just downloaded lib-gen, they didn’t buy the copies of the book they use and that would be a copyright violation.
That’s true whether or not you afterward do something transformative.
What a bizarre normative assertion. That copyright violation would be true whether or not they used it to train a model or indeed, deleted it immediately after downloading it. The copyright violation is one thing, and the model is another thing. The license that one would buy has nothing to do with any transformative ML use, and would deny that use if possible (and likely already contains language to the effect of denying as much as possible). There is no more connection than there is in the claim “if you rob a Starbucks, you should buy a pastry first”.
Yes, the copyright violation is true whether or not they used it to train a model. Douglas_Knight’s claim is that the copyright violation occurred. If that’s true, that makes it possible to sue them over it.
No, OpenAI is not arguing this. They are not arguing anything, but just hiding their sources. Maybe they’re arguing this about using the public web as training data, but that doesn’t cover pirated books.
Yes, a model is transformative, not infringement. But the question was about the training data. Is that infringement? Distributing the Pile is a tort and probably a crime by quantity. Acquiring the training data was a tort and probably a crime. I’m not sure about possessing it. Even if OpenAI is shielded from criminal responsibility, a crime was necessary for the creation and that was not enough to deter it.
OpenAI is in fact arguing this and wrote one of the primary position papers on the transformative position.
Does this link say anything about their illegal acquisition of the sources?
It sure looks to me like you and they are lying to distract. I condemn this lying, just as I condemned Christian’s proposed lies.
How would you do that? How would you write the laws?
Yes but only because I want to make sure that anyone who runs a copy of my soul gets a license for it from me first. it’s not going to slow down capabilities except differentially moving capabilities to other groups.
Too late. Much too late. Bird has flown the coop. No use shutting the barn doors after the cows have run away. Also, since the majority of the training data is just ‘the internet’, you can’t get rid of this except by monitoring everyone’s use of the internet to make sure they aren’t making webscrapes for the purpose of later training an AI on that data. If all you do is cause some bureaucratic hassle for US companies, you give an edge to non-US groups and probably also encourage US groups to go elsewhere.