[Question] Should we push for requiring AI training data to be licensed?

ChristianKl19 Oct 2022 17:49 UTC

37 points

We frequently speak about AI capability gain being bad because it shortens the timeframe for AI safety research. In that logic, taking steps to decrease AI capability would be worthwhile.

At the moment the large language models are trained with a lot of data without the company, that trains the language model, licensing the data. If there would be a requirement to license the required data, that would severely reduce the available data for language models and reduce their capabilities.

It’s expensive to fight lawsuits in the United States. Currently, there are artists who feel like their rights are violated by Dalle 2 using their art as training data. Similar to how Thiel funded the Gawker lawsuits, it would be possible to support artists in a suit against OpenAI to require OpenAI to license images for training Dalle 2. If such a lawsuit is well-funded it will be much more likely that a precedent for requiring data licensing gets set which would slow down AI development.

I’m curious about what people who think more about AI safety than myself think about such a move. Would it be helpful?

What links here?

ChristianKl19 Oct 2022 17:49 UTC

37 points

32 comments1 min readLW link

AI Governance AI

Zach Furman 19 Oct 2022 20:49 UTC
16 points
6
I’m not sure exactly where I land on this, but I think it’s important to consider that restricting the data companies can train on could influence the architectures they use. Self-supervised autoregressive models a-la GPT-3 seem a lot more benign than full-fledged RL agents. The latter is a lot less data hungry than the former (especially in terms of copyrighted data). There are enough other factors here to not make me completely confident in this analysis, but it’s worth thinking about.
- Jozdien 19 Oct 2022 21:09 UTC
  3 points
  2
  Parent
  I’m leaning toward the current paradigm being preferable to a full-fledged RL one, but want to add a point—one of my best guesses for proto-AGI involves massive LLMs hooked up to some RL system. This might not require RL capabilities on the same level of complexity as pure RL agents, and RL is still being worked on today.
  - dkirmani 14 Dec 2022 11:11 UTC
    2 points
    0
    Parent
    Agree, but LLM + RL is still preferable to muzero-style AGI.
    - Jozdien 14 Dec 2022 18:12 UTC
      2 points
      0
      Parent
      I agree, but this is a question of timelines too. Within the LLM + RL paradigm we may not need AGI-level RL or LLMs that can accessibly simulate AGI-level simulacra just from self-supervised learning, both of which would take longer than many points requiring intermediate levels of LLM and RL capabilities, because people are still working on RL stuff now.
Dave Orr 19 Oct 2022 20:37 UTC
5 points
0
I’m not a lawyer and this is not legal advice, but I think the current US legal framework isn’t going to work to challenge training on publicly available data.

One argument that something is fair use is that it is transformative [1]. And taking an image or text and using it to slightly influence a giant matrix of numbers, in such a way that the original is not recoverable, and which allows new kinds of expression, seems likely to count as transformative.

So if you think that restricting access to public data for training purposes is a promising approach [2], you should probably focus on trying to create a new regulatory framework.

Having said that, this is all US analysis. Other countries have other frameworks and may not have exact analogs of fair use. Perhaps in the EU legal challenges are more viable.

[1] https://www.nolo.com/legal-encyclopedia/fair-use-what-transformative.html

[2] You should think about what the side effects would be like. For instance, this will advantage giant companies that can pay to license or create data, and places that have low respect for law. Whether that’s desirable is worth thinking through.
- ChristianKl 19 Oct 2022 23:32 UTC
  2 points
  0
  Parent
  From the overview, it seems like there’s one Supreme court case and the other cases are from lower courts.
  Transformativeness is only one of the factors in the Supreme court case.
  The supreme court case does say:
  As to the music, this Court expresses no opinion whether repetition of the bass riff is excessive copying, but remands to permit evaluation of the amount taken, in light of the song’s parodic purpose and character, its transformative elements, and considerations of the potential for market substitution.
  (f) The Court of Appeals erred in resolving the fourth §107 factor, “the effect of the use upon the potential market for or value of the copyrighted work,” by presuming, in reliance on Sony, supra, at 451, the likelihood of significant market harm based on 2 Live Crew’s use for commercial gain. No “presumption” or inference of market harm that might find support in Sony is applicable to a case involving something beyond mere duplication for commercial purposes. The cognizable harm is market substitution, not any harm from criticism. As to parody pure and simple, it is unlikely that the work will act as a substitute for the original, since the two works usually serve different market functions. The fourth factor requires courts also to consider the potential market for derivative works. See, e. g., Harper & Row, supra, at 568. If the later work has cognizable substitution effects in protectable markets for derivative works, the law will look beyond the criticism to the work’s other elements. 2 Live Crew’s song comprises not only parody but also rap music. The absence of evidence or affidavits addressing the effect of 2 Live Crew’s song on the derivative market for a nonparody, rap version of “Oh, Pretty Woman” disentitled 2 Live Crew, as the proponent of the affirmative defense of fair use, to summary judgment. Pp. 20-25.
  If you would sue Dalle 2, you would likely argue that Dalle 2 does create market harm by creating competition.
  - Dave Orr 20 Oct 2022 0:27 UTC
    1 point
    0
    Parent
    Creating competition doesn’t count as harm—it has to be direct substitution for the work in question. That’s a pretty high bar.
    
    Also there are things like stable diffusion which arguably aren’t commercial (the code and model are free), which further undercuts the commercial purpose angle.
    
    I’m not saying any of this is dispositive—that’s the nature of balancing tests. I think this is going to be a tough row to hoe though, and certainly not a slam dunk to say that copyright should prevent ML training on publicly available data.
    
    (Still not a lawyer, still not legal advice!)
    - ChristianKl 20 Oct 2022 10:16 UTC
      2 points
      0
      Parent
      Let’s say an artist draws images in a very unique style. Afterwards, a lot of Dalle 2 images get created in the same style. That would make the style less unique and less valuable.
      
      It’s plausible financial harm in a way that doesn’t exist in the case on which the Supreme Court ruled.
      
      I don’t believe it’s a slam dunk either. I do believe there’s room for the Supreme Court to decide either way. The fact that it’s not a slam dunk either way suggests that spending money on making a stronger legal argument is valuable.
      - Dave Orr 20 Oct 2022 17:19 UTC
        3 points
        0
        Parent
        You can’t copyright a style.
qbolec 20 Oct 2022 20:02 UTC
4 points
0
It’s already happening https://githubcopilotinvestigation.com/ (which I’ve learned yesterday from is-github-copilot-in-legal-trouble post)
I think it would be interesting plot twist: humanity saved from AI FOOM by the big IT companies having to obey intellectual property rights they themselves defended for so many years :)
Douglas_Knight 13 Dec 2022 20:21 UTC
2 points
0
If you want to ban or monopolize such models, push for that directly. Indirectly banning them is evil.
They’re already illegal. GPT-3 is based in large part on what appear to be pirated books. (I wonder if google’s models are covered by its settlements with publishers.)
- gwern 13 Dec 2022 21:52 UTC
  5 points
  −3
  Parent
  
  GPT-3 is based in large part on what appear to be pirated books.
  
  Which is argued to be “transformative” and thus not illegal.
  - ChristianKl 13 Dec 2022 23:29 UTC
    3 points
    0
    Parent
    Even if it’s transformative they should still have to buy one license for each book to be able to use it.
    - gwern 14 Dec 2022 1:46 UTC
      4 points
      0
      Parent
      Why?
      - ChristianKl 14 Dec 2022 11:01 UTC
        2 points
        0
        Parent
        The basic idea of copyright is that if you want to acquire a copy of a book you need to buy that copy. If they just downloaded lib-gen, they didn’t buy the copies of the book they use and that would be a copyright violation.
        That’s true whether or not you afterward do something transformative.
        gwern 15 Dec 2022 23:10 UTC
        8 points
        3
        Parent
        What a bizarre normative assertion. That copyright violation would be true whether or not they used it to train a model or indeed, deleted it immediately after downloading it. The copyright violation is one thing, and the model is another thing. The license that one would buy has nothing to do with any transformative ML use, and would deny that use if possible (and likely already contains language to the effect of denying as much as possible). There is no more connection than there is in the claim “if you rob a Starbucks, you should buy a pastry first”.
        ChristianKl 16 Dec 2022 11:21 UTC
        3 points
        1
        Parent
        Yes, the copyright violation is true whether or not they used it to train a model. Douglas_Knight’s claim is that the copyright violation occurred. If that’s true, that makes it possible to sue them over it.
  - Douglas_Knight 13 Dec 2022 22:28 UTC
    2 points
    0
    Parent
    No, OpenAI is not arguing this. They are not arguing anything, but just hiding their sources. Maybe they’re arguing this about using the public web as training data, but that doesn’t cover pirated books.
    Yes, a model is transformative, not infringement. But the question was about the training data. Is that infringement? Distributing the Pile is a tort and probably a crime by quantity. Acquiring the training data was a tort and probably a crime. I’m not sure about possessing it. Even if OpenAI is shielded from criminal responsibility, a crime was necessary for the creation and that was not enough to deter it.
    - gwern 14 Dec 2022 1:46 UTC
      4 points
      1
      Parent
      
      No, OpenAI is not arguing this. They are not arguing anything, but just hiding their sources.
      
      OpenAI is in fact arguing this and wrote one of the primary position papers on the transformative position.
      - Douglas_Knight 14 Dec 2022 2:47 UTC
        0 points
        0
        Parent
        Does this link say anything about their illegal acquisition of the sources?
        It sure looks to me like you and they are lying to distract. I condemn this lying, just as I condemned Christian’s proposed lies.
- ChristianKl 13 Dec 2022 21:28 UTC
  2 points
  0
  Parent
  If you want to ban or monopolize such models, push for that directly.
  How would you do that? How would you write the laws?
the gears to ascension 20 Oct 2022 0:08 UTC
2 points
0
Yes but only because I want to make sure that anyone who runs a copy of my soul gets a license for it from me first. it’s not going to slow down capabilities except differentially moving capabilities to other groups.
Nathan Helm-Burger 28 Oct 2022 19:40 UTC
1 point
−1
Too late. Much too late. Bird has flown the coop. No use shutting the barn doors after the cows have run away. Also, since the majority of the training data is just ‘the internet’, you can’t get rid of this except by monitoring everyone’s use of the internet to make sure they aren’t making webscrapes for the purpose of later training an AI on that data. If all you do is cause some bureaucratic hassle for US companies, you give an edge to non-US groups and probably also encourage US groups to go elsewhere.

Shmi 19 Oct 2022 18:09 UTC
11 points
0
The usual question is… will the US cede the AI research advantage to countries which do not care about US copyright?
- jessicata 19 Oct 2022 20:10 UTC
  5 points
  1
  Parent
  I don’t especially think AI capabilities increases are bad on the margin, but if I did I would think of this as a multilateral disarmament problem where those who have the most capabilities (relative to something else, population/economy) and worst technological coordination should disarm first, similar to nukes; that would currently indicate US and UK over China etc. China has more precedent for government control over the economy than the West, so could more easily coordinate AI slowdown.
- ChristianKl 19 Oct 2022 19:26 UTC
  4 points
  0
  Parent
  Export embargo’s of GPU’s can be used for those countries as the US is currently attempting to do with China. “Country X unfairly competes by violating US copyright” is likely a very good argument to motivate the US government to work to do embargos like that.
  It’s also worth noting that this doesn’t hamper all AI fields. Alpha Fold for example works well on bioinformatic data that licensed in a way where everyone can use it freely.
  In addition to just slowing everything down, it creates also pressures for better curation of data as it makes it easier for companies who sell curated data to operate. Having people make more explicit decision about which data to include in models might be benefitial for safety.
  - jmh 19 Oct 2022 23:18 UTC
    2 points
    0
    Parent
    It is not clear to me export controls have been very successful so far—in pretty much every area.
    - ChristianKl 19 Oct 2022 23:29 UTC
      2 points
      0
      Parent
      Export controls don’t completely eliminate any exports but they do make it harder and more expensive. That’s especially true if someone needs a lot of GPUs.
      - jmh 20 Oct 2022 0:06 UTC
        2 points
        0
        Parent
        I think the mistake here is thinking that just because something get harder and more expensive that it will actually slow or materially reduce some stated outcome. The reason is that the costs and burdens may not fall on the targeted locus. In the GPU case that just means that none of those GPU get into uses such as consumer markets or gaming system/cards.
        While not perfect comparisons, in your view just how effective—say in time to build, time to improve or ability to pursue an activity—have export controls been regarding:
        Putin’s ability to conduct his war in Ukraine?
        North Korea’s nuclear and ballistic programs?
        China’s tech and military advancements?
        Iran’s nuclear and ballistic program?
        For me, I can only see that the export efforts have only imposed costs on the general population (to differing degrees) rather than materially impacting the actual target activity. Perhaps I’m missing something and each of the above would be much more a problem than if the sanctions/export restrictions had not been put in place.
        That said, I would agree that from a pure “on principle” basis not aiding and abetting something or someone you think is a danger to you is the right thing to do. However, I do think in this case it come much closer to virtue signalling than actions producing some binding constraint on the efforts.
        ChristianKl 20 Oct 2022 10:07 UTC
        2 points
        0
        Parent
        Putin’s military seems to run out of high precision ammunition and does not do well on the battlefield currently. It’s hard to say how much of this is due to export controls and how much is due to other factors.
        
        North Korea doesn’t seems to have a lot of reliable intercontinental missiles. Their tech development seems quite slowed down. It isn’t zero but also not fast.
        
        China’s tech advancement are a result of the West outsourcing a lot of tech production to China and not having embargos. I don’t know much about Chinese military tech.
        
        Iran still doesn’t have nuclear weapons. It might still get them, but there’s certainly a slowdown of development.
        
        Nuclear and ballistic technology is pretty well understood for decades developing AGI without having existing designs to copy will be much harder.
Daniel Paleka 19 Oct 2022 18:08 UTC
3 points
0
Epistemic status: I’d give >10% on Metaculus resolving the following as conventional wisdom^[1] in 2026.
1. Autoregressive-modeling-of-human-language capabilities are well-behaved, scaling laws can help us predict what happens, interpretability methods developed on smaller models scale up to larger ones, …
2. Models-learning-from-themselves have runaway potential, how a model changes after [more training / architecture changes / training setup modifications] is harder to predict than in models trained on 2022 datasets.
3. Replacing human-generated data with model-generated data was a mistake^[2].
1. ^
  In the sense that e.g. chain of thought improves capabilities is conventional wisdom in 2022.
2. ^
  In the sense of x-safety. I have no confident insight either way on how abstaining from very large human-generated datasets influences capabilities long-term. If someone has, please refrain from discussing that publicly, of course.
ChristianKl 26 Oct 2022 14:36 UTC
2 points
0
Discussion by Jason Calacanis and Molly Wood about why they think that artists likely have rights that can be violated by Dalle 2: