Which the old version certainly would have done. The central thing the bill intends to do is to require effective watermarking for all AIs capable of fooling humans into thinking they are producing ‘real’ content, and labeling of all content everywhere.
OpenAI is known to have been sitting on a 99.9% effective (by their own measure) watermarking system for a year. They chose not to deploy it, because it would hurt their business – people want to turn in essays and write emails, and would rather the other person not know that ChatGPT wrote them.
As far as we know, no other company has similar technology. It makes sense that they would want to mandate watermarking everywhere.
Is watermarking actually really difficult? The overall concept seems straightforward, the most obvious ways to do it doesn’t require any fiddling with model internals, (so you don’t need to have AI expertise to do, or do expensive human work for your specific system like RLHF), and Scott Aaronson claims that a single OpenAI engineer was able to build a prototype pretty quickly.
I imagine if this becomes law some academics can probably hack together an open source solution quickly. So I’m skeptical that the regulatory capture angle could be particularly strong.
(I might be too optimistic about the engineering difficulties and amount of schlep needed, of course).
If the academics can hack together an open source solution why haven’t they? Seems like it would be a highly cited, very popular paper. What’s the theory on why they don’t do it?
Just spitballing, but it doesn’t seem theoretically interesting to academics unless they’re bringing something novel (algorithmically or in design) to the table, and practically not useful unless implemented widely, since it’s trivial for e.g. college students to use the least watermarked model.
One, even if no one used it, there would still be value in demonstrating it was possible—if academia only develops things people will adapt commercially right away then we might as well dissolve academia. This is a highly interesting and potentially important problem, people should be excited.
Two, there would presumably at minimum be demand to give students (for example) access to a watermarked LLM, so they could benefit from it without being able to cheat. That’s even an academic motivation. And if the major labs won’t do it, someone can build a Llama version or what not for this, no?
Yeah, I think the simplest thing for image generation is for model hosting providers to use a separate tool—and lots of work on that already exists. (see, e.g., this, or this, or this, for different flavors.) And this is explicitly allowed by the bill.
For text, it’s harder to do well, and you only get weak probabilistic identification, but it’s also easy to implement an Aaronson-like scheme, even if doing it really well is harder. (I say easy because I’m pretty sure I could do it myself, given, say, a month working with one of the LLM providers, and I’m wildly underqualified to do software dev like this.)
Is watermarking actually really difficult? The overall concept seems straightforward, the most obvious ways to do it doesn’t require any fiddling with model internals, (so you don’t need to have AI expertise to do, or do expensive human work for your specific system like RLHF), and Scott Aaronson claims that a single OpenAI engineer was able to build a prototype pretty quickly.
I imagine if this becomes law some academics can probably hack together an open source solution quickly. So I’m skeptical that the regulatory capture angle could be particularly strong.
(I might be too optimistic about the engineering difficulties and amount of schlep needed, of course).
If the academics can hack together an open source solution why haven’t they? Seems like it would be a highly cited, very popular paper. What’s the theory on why they don’t do it?
Just spitballing, but it doesn’t seem theoretically interesting to academics unless they’re bringing something novel (algorithmically or in design) to the table, and practically not useful unless implemented widely, since it’s trivial for e.g. college students to use the least watermarked model.
No one would use it if not forced to?
Two responses.
One, even if no one used it, there would still be value in demonstrating it was possible—if academia only develops things people will adapt commercially right away then we might as well dissolve academia. This is a highly interesting and potentially important problem, people should be excited.
Two, there would presumably at minimum be demand to give students (for example) access to a watermarked LLM, so they could benefit from it without being able to cheat. That’s even an academic motivation. And if the major labs won’t do it, someone can build a Llama version or what not for this, no?
Yeah, I think the simplest thing for image generation is for model hosting providers to use a separate tool—and lots of work on that already exists. (see, e.g., this, or this, or this, for different flavors.) And this is explicitly allowed by the bill.
For text, it’s harder to do well, and you only get weak probabilistic identification, but it’s also easy to implement an Aaronson-like scheme, even if doing it really well is harder. (I say easy because I’m pretty sure I could do it myself, given, say, a month working with one of the LLM providers, and I’m wildly underqualified to do software dev like this.)