However, you are not allowed to just blindly reproduce large chunks of what you read. That would be both plagiarism (morally), and (unless some form of fair use applies) breach of copyright (legally). Many simplistic for-a-lay-person explanations of how AI works imply that this is what they do, and people with little credence of AI capabilities increasing often assume this is both all they can do and all they will ever be able to do.
Also, on rare occasions for certain prompts and certain documents in the training set that actually got memorized during the training (for example, because the training set contained a great many copies of much-the-same-document), AIs actually do do this, reproduce significant chunks of a copyrighted document verbatim or only very slightly paraphrased, and we don’t know how to ensure that will never happen (other than building a plagiarism detector to detect it has happened and then refusing to send the response to the end user).
Of course, a typical sentence in a typical AI response is influenced (when researchers did the very computationally expensive analysis to determine this) primarily by hundreds or thousands or more of documents across its training set: it’s based on learning and combining patterns, not memorizing specific text passages. However, in a court of law, telling the judge “our device usually doesn’t break the law, but sometimes it does, especially if goaded to, and we don’t know how to make it stop” isn’t exactly a strong position.
Also, the one clear legal precedent we do have in the area of copyright is that anything created by an AI does not have the same status for claiming copyright as something original created by a human. Which makes it not entirely clear whether the argument “that AI behavior would be legal if a human did it” applies here. Anything not forbidden is legal, but is the “being influenced by many sources” aspect of fair use a boundary to the law of copyright, or an exception, and if it’s an exception, does it apply to something that isn’t human? A question which the legeslators, of course, never even considered, forcing judges to make it up as they go along. (Of course, in the US, corporations now have free speech rights — I’m sure an ingenious lawyer could work that into an argument, for a closed-source API-only AI owned by a corporation…)
Yes, you’re right on all counts. I’m just wondering if there’s anyone who thinks this there is actually a coherent underlying justification for this kind of standard, other than “Because people who never actually thought about it said so.”
Also:
However, in a court of law, telling the judge “our device usually doesn’t break the law, but sometimes it does, especially if goaded to, and we don’t know how to make it stop” isn’t exactly a strong position.
This is true, and yet it is also the position anyone making any kind of dangerous product is in. Cars and planes and knives and various chemicals can be easily goaded to break the law by the user. No one has yet released a car that only ever follows all applicable laws no matter what the driver does.
As you pointed out, we don’t consider AIs to have minds and thoughts and rights under current law, which would seem to make them products under human control for such purposes. The producer is liable for making things work as described. The user is responsible for using them in a way that is legal and doesn’t harm others. I don’t understand the argument for the producer being on the hook for the user finding a way to use it to duplicate copyrighted material.
As I understand it (#NotALawyer) the law makes a distinction between selling a toolkit, which has many legal uses and can also help you steal cars, and selling a toolkit with advertising about how good it is for stealing cars and helpful instructions on how to use it to do so. Some of the AI image generation models included single joined_by_underscores keywords for the names of artists (who hadn’t consented to being included) to reproduce their style, and instructions on how to do that. With the wrong rest of the prompt, that would sometimes even reproduce a near-copy of a single artwork by that artist from the training set. We’ll see how that court case goes. (My understanding is that a style is not considered copyrightable but a specific image or a sufficient number of elements from it is.)
Sooner or later, we’ll have robots that are physically and mentally capable of stealing a car all by themselves, if that would help them fulfill an otherwise-legal instruction from their owner. The law is going to hold someone responsible for ensuring that the robots don’t do that: some combination of the manufacturer and the owner/end-user, according to which seems more reasonable to the judge and jury.
Cars and planes and knives and various chemicals can be easily goaded to break the law by the user. No one has yet released a car that only ever follows all applicable laws no matter what the driver does.
Without taking a position on the copyright problem as a whole, there’s an important distinction here around how straightforward the user’s control is. A typical knife is operated in a way where deliberate, illegal knife-related actions can reasonably be seen as a direct extension of the user’s intent (and accidental ones an extension of the user’s negligence). A traditional car is more complex, but cars are also subject to licensing regimes which establish social proof that the user has been trained in how to produce intended results when operating the car, so that illegal car-related actions can be similarly seen as an extension of the user’s intent or negligence. Comparing this to the legal wrangling around cars with ‘smarter’ autonomous driving features may be informative, because that’s when it gets more ambiguous how much of the result is a direct translation of the user’s intent. There does seem to be a lot of legal and social pressure on manufacturers to ensure the safety of autonomous driving by technical means, but I’m not as sure about legality; in particular, I vaguely remember mixed claims around the way self-driving features handle the tension between posted speed limits and commonplace human driving behavior in the US.
In the case of a chatbot, the part where the bot makes use of a vast quantity of information that the user isn’t directly aware of as part of forming its responses is necessary for its purpose, so expecting a reasonable user to take responsibility for anticipating and preventing any resulting copyright violations is not practical. Here, comparing chatbot output to that of search engines—a step down in the tool’s level of autonomy, rather than a step up as in the previous car comparison—may be informative. The purpose of a search engine similarly relies on the user not being able to directly anticipate the results, but the results can point to material that contains copyright violations or other content that is illegal to distribute. And even though those results are primarily links instead of direct inclusions, there’s legal and social pressure on search engines to do filtering and enforce specific visibility takedowns on demand.
So there’s clearly some kind of spectrum here between user responsibility and vendor responsibility that depends on how ‘twisty’ the product is to operate.
However, you are not allowed to just blindly reproduce large chunks of what you read. That would be both plagiarism (morally), and (unless some form of fair use applies) breach of copyright (legally). Many simplistic for-a-lay-person explanations of how AI works imply that this is what they do, and people with little credence of AI capabilities increasing often assume this is both all they can do and all they will ever be able to do.
Also, on rare occasions for certain prompts and certain documents in the training set that actually got memorized during the training (for example, because the training set contained a great many copies of much-the-same-document), AIs actually do do this, reproduce significant chunks of a copyrighted document verbatim or only very slightly paraphrased, and we don’t know how to ensure that will never happen (other than building a plagiarism detector to detect it has happened and then refusing to send the response to the end user).
Of course, a typical sentence in a typical AI response is influenced (when researchers did the very computationally expensive analysis to determine this) primarily by hundreds or thousands or more of documents across its training set: it’s based on learning and combining patterns, not memorizing specific text passages. However, in a court of law, telling the judge “our device usually doesn’t break the law, but sometimes it does, especially if goaded to, and we don’t know how to make it stop” isn’t exactly a strong position.
Also, the one clear legal precedent we do have in the area of copyright is that anything created by an AI does not have the same status for claiming copyright as something original created by a human. Which makes it not entirely clear whether the argument “that AI behavior would be legal if a human did it” applies here. Anything not forbidden is legal, but is the “being influenced by many sources” aspect of fair use a boundary to the law of copyright, or an exception, and if it’s an exception, does it apply to something that isn’t human? A question which the legeslators, of course, never even considered, forcing judges to make it up as they go along. (Of course, in the US, corporations now have free speech rights — I’m sure an ingenious lawyer could work that into an argument, for a closed-source API-only AI owned by a corporation…)
Yes, you’re right on all counts. I’m just wondering if there’s anyone who thinks this there is actually a coherent underlying justification for this kind of standard, other than “Because people who never actually thought about it said so.”
Also:
This is true, and yet it is also the position anyone making any kind of dangerous product is in. Cars and planes and knives and various chemicals can be easily goaded to break the law by the user. No one has yet released a car that only ever follows all applicable laws no matter what the driver does.
As you pointed out, we don’t consider AIs to have minds and thoughts and rights under current law, which would seem to make them products under human control for such purposes. The producer is liable for making things work as described. The user is responsible for using them in a way that is legal and doesn’t harm others. I don’t understand the argument for the producer being on the hook for the user finding a way to use it to duplicate copyrighted material.
As I understand it (#NotALawyer) the law makes a distinction between selling a toolkit, which has many legal uses and can also help you steal cars, and selling a toolkit with advertising about how good it is for stealing cars and helpful instructions on how to use it to do so. Some of the AI image generation models included single joined_by_underscores keywords for the names of artists (who hadn’t consented to being included) to reproduce their style, and instructions on how to do that. With the wrong rest of the prompt, that would sometimes even reproduce a near-copy of a single artwork by that artist from the training set. We’ll see how that court case goes. (My understanding is that a style is not considered copyrightable but a specific image or a sufficient number of elements from it is.)
Sooner or later, we’ll have robots that are physically and mentally capable of stealing a car all by themselves, if that would help them fulfill an otherwise-legal instruction from their owner. The law is going to hold someone responsible for ensuring that the robots don’t do that: some combination of the manufacturer and the owner/end-user, according to which seems more reasonable to the judge and jury.
Without taking a position on the copyright problem as a whole, there’s an important distinction here around how straightforward the user’s control is. A typical knife is operated in a way where deliberate, illegal knife-related actions can reasonably be seen as a direct extension of the user’s intent (and accidental ones an extension of the user’s negligence). A traditional car is more complex, but cars are also subject to licensing regimes which establish social proof that the user has been trained in how to produce intended results when operating the car, so that illegal car-related actions can be similarly seen as an extension of the user’s intent or negligence. Comparing this to the legal wrangling around cars with ‘smarter’ autonomous driving features may be informative, because that’s when it gets more ambiguous how much of the result is a direct translation of the user’s intent. There does seem to be a lot of legal and social pressure on manufacturers to ensure the safety of autonomous driving by technical means, but I’m not as sure about legality; in particular, I vaguely remember mixed claims around the way self-driving features handle the tension between posted speed limits and commonplace human driving behavior in the US.
In the case of a chatbot, the part where the bot makes use of a vast quantity of information that the user isn’t directly aware of as part of forming its responses is necessary for its purpose, so expecting a reasonable user to take responsibility for anticipating and preventing any resulting copyright violations is not practical. Here, comparing chatbot output to that of search engines—a step down in the tool’s level of autonomy, rather than a step up as in the previous car comparison—may be informative. The purpose of a search engine similarly relies on the user not being able to directly anticipate the results, but the results can point to material that contains copyright violations or other content that is illegal to distribute. And even though those results are primarily links instead of direct inclusions, there’s legal and social pressure on search engines to do filtering and enforce specific visibility takedowns on demand.
So there’s clearly some kind of spectrum here between user responsibility and vendor responsibility that depends on how ‘twisty’ the product is to operate.