Most of my copyright knowledge comes from the debian-legal mailing list in the 90s, about the limits of GPL and interaction of various “free-ish” licenses, especially around the Affero versions. The consensus was that copyright restricted distribution and giving copies to others, and did not restrict use that didn’t include transfer of data. Contrary to this, the Affero components did get adopted into GPLv3, and it seems to have teeth, so we were wrong.
Which means it’s like many big legal questions about things that are only recently important: the courts have to decide. Ideally, congress would pass clarifying legislation, but that’s not what they do anymore. There are good arguments on both sides, so my suspicion is it’ll go to the Supreme Court before really being resolved. And if the resolution is that it’s a violation of copyright to train models on copyright-constrained work, it’ll probably move more of the modeling out of the us.
And if the resolution is that it’s a violation of copyright to train models on copyright-constrained work, it’ll probably move more of the modeling out of the us.
That’s definitely an outcome* although if you think about it, LLMs are just a crutch. The end goal is to understand a user’s prompt and generate an output that is likely to be correct in a factual/mathematic/the code runs sense. Most AI problems are still RL problems in the end.
What this means is that a distilled model trained on the output of a model that trained on everything would lose the ability to verbatim quote anything copyrighted. This means that all of the information scraped from anywhere that provides it was used, but not distributed.
And then obviously the next stage would be to RL train the distilled model, losing even more copyrighted quirks. Doing this also sheds a lot of useless information—there’s a ton of misinformation online that is repeated over and over, and human quirks that commonly show up the LLM will mimic but don’t improve the model’s ability to emit a correct answer.
One simple way to do this is some of the generations contain misinformation, another model researches each claim and finds the misinformation, caching the results, and every generation the model makes in the future with the same misinformation gets negative RL feedback.
This “multi stage” method, where there’s in internal “dirty” model and an external “clean” model is how IBM compatible BIOS was created.
Of course there are tradeoffs. In theory the above would lose the ability to say, generate Harry Potter fanfics, but it would be able to write a grammatically correct, cohesive story about a school for young wizards.
Most of my copyright knowledge comes from the debian-legal mailing list in the 90s, about the limits of GPL and interaction of various “free-ish” licenses, especially around the Affero versions. The consensus was that copyright restricted distribution and giving copies to others, and did not restrict use that didn’t include transfer of data. Contrary to this, the Affero components did get adopted into GPLv3, and it seems to have teeth, so we were wrong.
Which means it’s like many big legal questions about things that are only recently important: the courts have to decide. Ideally, congress would pass clarifying legislation, but that’s not what they do anymore. There are good arguments on both sides, so my suspicion is it’ll go to the Supreme Court before really being resolved. And if the resolution is that it’s a violation of copyright to train models on copyright-constrained work, it’ll probably move more of the modeling out of the us.
That’s definitely an outcome* although if you think about it, LLMs are just a crutch. The end goal is to understand a user’s prompt and generate an output that is likely to be correct in a factual/mathematic/the code runs sense. Most AI problems are still RL problems in the end.
https://www.biia.com/japan-goes-all-in-copyright-doesnt-apply-to-ai-training/
What this means is that a distilled model trained on the output of a model that trained on everything would lose the ability to verbatim quote anything copyrighted. This means that all of the information scraped from anywhere that provides it was used, but not distributed.
And then obviously the next stage would be to RL train the distilled model, losing even more copyrighted quirks. Doing this also sheds a lot of useless information—there’s a ton of misinformation online that is repeated over and over, and human quirks that commonly show up the LLM will mimic but don’t improve the model’s ability to emit a correct answer.
One simple way to do this is some of the generations contain misinformation, another model researches each claim and finds the misinformation, caching the results, and every generation the model makes in the future with the same misinformation gets negative RL feedback.
This “multi stage” method, where there’s in internal “dirty” model and an external “clean” model is how IBM compatible BIOS was created.
https://en.wikipedia.org/wiki/Clean_room_design
https://www.quora.com/How-did-Compaq-reverse-engineered-patented-IBM-code
Of course there are tradeoffs. In theory the above would lose the ability to say, generate Harry Potter fanfics, but it would be able to write a grammatically correct, cohesive story about a school for young wizards.