Mikhail Samin comments on NYT is suing OpenAI&Microsoft for alleged copyright infringement; some quick thoughts

Mikhail Samin 27 Dec 2023 21:32 UTC
−3 points
−5
Humans can’t learn from any materials that NYT has published without paying NYT or otherwise getting a permission, as NYT articles are usually paywalled. NYT, in my opinion, should have the right to restrict commercial use of the work they own.

The current question isn’t whether digital people are allowed to look at something at learn from it the way humans are allowed to; the current question is whether for-profit AI companies can use copyrighted human work to create arrays of numbers that represent the work process behind the copyrighted material and the material itself by changing these numbers to increase the likelihood of specific operations on them producing the copyrighted material. These AI companies then use these extracted work processes to compete with the original possessors of these processes. [To be clear, I believe that further refinement of these numbers to make something that also successfully achieves long-term goals is likely to lead to no human or digital consciousness existing or learning or doing anything of value (even if we embrace some pretty cosmopolitan views, see https://moratorium.ai for my reasoning on this), which might bias me towards wanting regulation that prevents big labs from achieving ASI until safety is solved, especially with policies that support innovation, startups, etc., anything that has benefits without risking the existence of our civilisation.]
- gjm 28 Dec 2023 17:27 UTC
  0 points
  −2
  Parent
  If in the specific case of NYT articles the articles in question aren’t intended to be publicly accessible, then this isn’t just a copyright matter. But the OP doesn’t just say “there should be regulations to make it illegal to sneak around access restrictions in order to train AIs on material you don’t have access to”, it says there should be regulations to prohibit training AIs on copyrighted material. Which is to say, on pretty much any product of human creativity. And that’s a much broader claim.
  Your description at the start of the second paragraph seems kinda tendentious. What does it have to do with anything that the process involves “arrays of numbers”? In what sense do these numbers “represent the work process behind the copyrighted material”? (And in what sense if any is that truer of AI systems than of human brains that learn from the same copyrighted material? My guess is that it’s much truer of the humans.) The bit about “increase the likelihood of … producing the copyrighted material” isn’t wrong exactly, but it’s misleading and I think you must know it: it’s the likelihood of producing the next token of that material given the context of all the previous tokens, and actually reproducing the input in bulk is very much not a goal.
  It may well be true that all progress on AI is progress toward our doom, but it’s not obviously appropriate to go from that to “so we should pass laws that make it illegal to train AIs on copyrighted text”. That seems a bit like going from “Elon Musk’s politics are too right-wing for my taste and making him richer is bad” to “so we should ban electric vehicles” or from “the owner of this business is gay and I personally disapprove of same-sex relationships” to “so I should encourage people to boycott the business”. In each case, doing the thing may have the consequences you want, but it’s not an appropriate way to pursue those consequences.