lawmakers should introduce regulations to make it an infringement
to train an AI on content that’s copyrighted but publicly accessible? Should it also be illegal for people to learn from copyrighted material? If not, what are the relevant differences between humans and AIs? Are there possible future situations in which you think it would be OK for AIs to learn from copyrighted material?
Imagine, e.g., a situation where somehow it’s turned out that the AIs we make are broadly comparable to humans in cognitive abilities, and they “live” alongside us in something like the way you see in some science fiction where there are humans and somewhat-human-like robots, and they learn in something like the way humans do. Would you then want humans able to learn from any materials that have been published, while the AIs have to learn only from material that has explicit “yes, AIs can learn from this” permissions attached? You may well feel that this sort of scenario is wildly improbable, and you’d probably be right, but if you would want AIs able to learn from the same things as humans in that scenario but not in more-probable ones, what is it about these scenarios that makes the difference?
I’d suggest looking at this from a consequentialist perspective.
One of your questions was, “Should it also be illegal for people to learn from copyrighted material?” This seems to imply that whether a policy is good for AIs depends on whether it would be good for humans. It’s almost a Kantian perspective—“What would happen if we universalized this principle?” But I don’t think that’s a good heuristic for AI policy. For just one example, I don’t think AIs should be given constitutional rights, but humans clearly should.
My other comment explains why I think the consequences of restricting training data would be positive.
I don’t say that the same policies must necessarily apply to AIs and humans. But I do say that if they don’t then there should be a reason why they treat AIs and humans differently.
If a law treats people a certain way, there must be a reason for that, because people have rights.
But if a law treats non-people a certain way, there doesn’t need to be any reason for that. All that is required is that there be good reasons for what consequences the law has for people.
There does not seem to be any reason why the default should be to treat AIs and humans the same way (or to treat AIs in any particular way).
I think “humans are people and AIs aren’t” could be a perfectly good reason for treating them differently, and didn’t intend to say otherwise. So, e.g., if Mikhail had said “Humans should be allowed to learn from anything they can read because doing so is a basic human right and it would be unjust to forbid that; today’s AIs aren’t the sort of things that have rights, so that doesn’t apply to them at all” then that would have been a perfectly cromulent answer. (With, e.g., the implication that to whatever extent that’s the whole reason for treating them differently in this case, the appropriate rules might change dramatically if and when there are AIs that we find it appropriate to think of as persons having rights.)
Humans can’t learn from any materials that NYT has published without paying NYT or otherwise getting a permission, as NYT articles are usually paywalled. NYT, in my opinion, should have the right to restrict commercial use of the work they own.
The current question isn’t whether digital people are allowed to look at something at learn from it the way humans are allowed to; the current question is whether for-profit AI companies can use copyrighted human work to create arrays of numbers that represent the work process behind the copyrighted material and the material itself by changing these numbers to increase the likelihood of specific operations on them producing the copyrighted material. These AI companies then use these extracted work processes to compete with the original possessors of these processes. [To be clear, I believe that further refinement of these numbers to make something that also successfully achieves long-term goals is likely to lead to no human or digital consciousness existing or learning or doing anything of value (even if we embrace some pretty cosmopolitan views, see https://moratorium.ai for my reasoning on this), which might bias me towards wanting regulation that prevents big labs from achieving ASI until safety is solved, especially with policies that support innovation, startups, etc., anything that has benefits without risking the existence of our civilisation.]
If in the specific case of NYT articles the articles in question aren’t intended to be publicly accessible, then this isn’t just a copyright matter. But the OP doesn’t just say “there should be regulations to make it illegal to sneak around access restrictions in order to train AIs on material you don’t have access to”, it says there should be regulations to prohibit training AIs on copyrighted material. Which is to say, on pretty much any product of human creativity. And that’s a much broader claim.
Your description at the start of the second paragraph seems kinda tendentious. What does it have to do with anything that the process involves “arrays of numbers”? In what sense do these numbers “represent the work process behind the copyrighted material”? (And in what sense if any is that truer of AI systems than of human brains that learn from the same copyrighted material? My guess is that it’s much truer of the humans.) The bit about “increase the likelihood of … producing the copyrighted material” isn’t wrong exactly, but it’s misleading and I think you must know it: it’s the likelihood of producing the next token of that material given the context of all the previous tokens, and actually reproducing the input in bulk is very much not a goal.
It may well be true that all progress on AI is progress toward our doom, but it’s not obviously appropriate to go from that to “so we should pass laws that make it illegal to train AIs on copyrighted text”. That seems a bit like going from “Elon Musk’s politics are too right-wing for my taste and making him richer is bad” to “so we should ban electric vehicles” or from “the owner of this business is gay and I personally disapprove of same-sex relationships” to “so I should encourage people to boycott the business”. In each case, doing the thing may have the consequences you want, but it’s not an appropriate way to pursue those consequences.
Why do you think that
to train an AI on content that’s copyrighted but publicly accessible? Should it also be illegal for people to learn from copyrighted material? If not, what are the relevant differences between humans and AIs? Are there possible future situations in which you think it would be OK for AIs to learn from copyrighted material?
Imagine, e.g., a situation where somehow it’s turned out that the AIs we make are broadly comparable to humans in cognitive abilities, and they “live” alongside us in something like the way you see in some science fiction where there are humans and somewhat-human-like robots, and they learn in something like the way humans do. Would you then want humans able to learn from any materials that have been published, while the AIs have to learn only from material that has explicit “yes, AIs can learn from this” permissions attached? You may well feel that this sort of scenario is wildly improbable, and you’d probably be right, but if you would want AIs able to learn from the same things as humans in that scenario but not in more-probable ones, what is it about these scenarios that makes the difference?
I’d suggest looking at this from a consequentialist perspective.
One of your questions was, “Should it also be illegal for people to learn from copyrighted material?” This seems to imply that whether a policy is good for AIs depends on whether it would be good for humans. It’s almost a Kantian perspective—“What would happen if we universalized this principle?” But I don’t think that’s a good heuristic for AI policy. For just one example, I don’t think AIs should be given constitutional rights, but humans clearly should.
My other comment explains why I think the consequences of restricting training data would be positive.
I don’t say that the same policies must necessarily apply to AIs and humans. But I do say that if they don’t then there should be a reason why they treat AIs and humans differently.
Why?
If a law treats people a certain way, there must be a reason for that, because people have rights.
But if a law treats non-people a certain way, there doesn’t need to be any reason for that. All that is required is that there be good reasons for what consequences the law has for people.
There does not seem to be any reason why the default should be to treat AIs and humans the same way (or to treat AIs in any particular way).
I think “humans are people and AIs aren’t” could be a perfectly good reason for treating them differently, and didn’t intend to say otherwise. So, e.g., if Mikhail had said “Humans should be allowed to learn from anything they can read because doing so is a basic human right and it would be unjust to forbid that; today’s AIs aren’t the sort of things that have rights, so that doesn’t apply to them at all” then that would have been a perfectly cromulent answer. (With, e.g., the implication that to whatever extent that’s the whole reason for treating them differently in this case, the appropriate rules might change dramatically if and when there are AIs that we find it appropriate to think of as persons having rights.)
Humans can’t learn from any materials that NYT has published without paying NYT or otherwise getting a permission, as NYT articles are usually paywalled. NYT, in my opinion, should have the right to restrict commercial use of the work they own.
The current question isn’t whether digital people are allowed to look at something at learn from it the way humans are allowed to; the current question is whether for-profit AI companies can use copyrighted human work to create arrays of numbers that represent the work process behind the copyrighted material and the material itself by changing these numbers to increase the likelihood of specific operations on them producing the copyrighted material. These AI companies then use these extracted work processes to compete with the original possessors of these processes. [To be clear, I believe that further refinement of these numbers to make something that also successfully achieves long-term goals is likely to lead to no human or digital consciousness existing or learning or doing anything of value (even if we embrace some pretty cosmopolitan views, see https://moratorium.ai for my reasoning on this), which might bias me towards wanting regulation that prevents big labs from achieving ASI until safety is solved, especially with policies that support innovation, startups, etc., anything that has benefits without risking the existence of our civilisation.]
If in the specific case of NYT articles the articles in question aren’t intended to be publicly accessible, then this isn’t just a copyright matter. But the OP doesn’t just say “there should be regulations to make it illegal to sneak around access restrictions in order to train AIs on material you don’t have access to”, it says there should be regulations to prohibit training AIs on copyrighted material. Which is to say, on pretty much any product of human creativity. And that’s a much broader claim.
Your description at the start of the second paragraph seems kinda tendentious. What does it have to do with anything that the process involves “arrays of numbers”? In what sense do these numbers “represent the work process behind the copyrighted material”? (And in what sense if any is that truer of AI systems than of human brains that learn from the same copyrighted material? My guess is that it’s much truer of the humans.) The bit about “increase the likelihood of … producing the copyrighted material” isn’t wrong exactly, but it’s misleading and I think you must know it: it’s the likelihood of producing the next token of that material given the context of all the previous tokens, and actually reproducing the input in bulk is very much not a goal.
It may well be true that all progress on AI is progress toward our doom, but it’s not obviously appropriate to go from that to “so we should pass laws that make it illegal to train AIs on copyrighted text”. That seems a bit like going from “Elon Musk’s politics are too right-wing for my taste and making him richer is bad” to “so we should ban electric vehicles” or from “the owner of this business is gay and I personally disapprove of same-sex relationships” to “so I should encourage people to boycott the business”. In each case, doing the thing may have the consequences you want, but it’s not an appropriate way to pursue those consequences.