And if the resolution is that it’s a violation of copyright to train models on copyright-constrained work, it’ll probably move more of the modeling out of the us.
That’s definitely an outcome* although if you think about it, LLMs are just a crutch. The end goal is to understand a user’s prompt and generate an output that is likely to be correct in a factual/mathematic/the code runs sense. Most AI problems are still RL problems in the end.
What this means is that a distilled model trained on the output of a model that trained on everything would lose the ability to verbatim quote anything copyrighted. This means that all of the information scraped from anywhere that provides it was used, but not distributed.
And then obviously the next stage would be to RL train the distilled model, losing even more copyrighted quirks. Doing this also sheds a lot of useless information—there’s a ton of misinformation online that is repeated over and over, and human quirks that commonly show up the LLM will mimic but don’t improve the model’s ability to emit a correct answer.
One simple way to do this is some of the generations contain misinformation, another model researches each claim and finds the misinformation, caching the results, and every generation the model makes in the future with the same misinformation gets negative RL feedback.
This “multi stage” method, where there’s in internal “dirty” model and an external “clean” model is how IBM compatible BIOS was created.
Of course there are tradeoffs. In theory the above would lose the ability to say, generate Harry Potter fanfics, but it would be able to write a grammatically correct, cohesive story about a school for young wizards.
That’s definitely an outcome* although if you think about it, LLMs are just a crutch. The end goal is to understand a user’s prompt and generate an output that is likely to be correct in a factual/mathematic/the code runs sense. Most AI problems are still RL problems in the end.
https://www.biia.com/japan-goes-all-in-copyright-doesnt-apply-to-ai-training/
What this means is that a distilled model trained on the output of a model that trained on everything would lose the ability to verbatim quote anything copyrighted. This means that all of the information scraped from anywhere that provides it was used, but not distributed.
And then obviously the next stage would be to RL train the distilled model, losing even more copyrighted quirks. Doing this also sheds a lot of useless information—there’s a ton of misinformation online that is repeated over and over, and human quirks that commonly show up the LLM will mimic but don’t improve the model’s ability to emit a correct answer.
One simple way to do this is some of the generations contain misinformation, another model researches each claim and finds the misinformation, caching the results, and every generation the model makes in the future with the same misinformation gets negative RL feedback.
This “multi stage” method, where there’s in internal “dirty” model and an external “clean” model is how IBM compatible BIOS was created.
https://en.wikipedia.org/wiki/Clean_room_design
https://www.quora.com/How-did-Compaq-reverse-engineered-patented-IBM-code
Of course there are tradeoffs. In theory the above would lose the ability to say, generate Harry Potter fanfics, but it would be able to write a grammatically correct, cohesive story about a school for young wizards.