My quick take is that this is not that great, and I’ll explain why below.
First, the scope is quite bad, and I want to say why this is a bad scope below:
Scope: Where commitments mention particular models, they apply only to generative models that are overall more powerful than the current industry frontier (e.g. models that are overall more powerful than any currently released models, including GPT-4, Claude 2, PaLM 2, Titan and, in the case of image generation, DALL-E 2).
I consider this to be a bad scope, because it burdens much safer models with regulations while allowing RL, especially deep RL to mostly be unregulated, and I worry that this will set a dangerous precedent, primarily because I learned something important when reading porby’s posts: It is actually good that, by and large, generative models/simulators/predictors have come first, or at least it’s way easier than RL from an non-misuse extinction risk perspective since they have many nice safety features, like a lack of instrumental goals, due to it being densely informative, and in general much easier alignment targets.
I really hope this gets fixed, soon, because I fear that this will just place adversarial pressure on safe AI by default, especially with condition 8 below.
Develop and deploy frontier AI systems to help address society’s greatest challenges
Companies making this commitment agree to support research and development of frontier AI systems that can help meet society’s greatest challenges, such as climate change mitigation and adaptation, early cancer detection and prevention, and combating cyber threats. Companies also commit to supporting initiatives that foster the education and training of students and workers to prosper from the benefits of AI, and to helping citizens understand the nature, capabilities, limitations, and impact of the technology.
This seems like it’s throwing a bone to the e/acc inside them, at least in part. This is because they have the ability to justify arbitrary capabilities research if they can meet this commitment, and this seems like an indication that any regulation will probably not ban AI progress, or perhaps even slow down AI all that much, because of this here:
Companies intend these voluntary commitments to remain in effect until regulations covering substantially the same issues come into force.
Condition 5 is impossible, and how they deal with that impossibility will plausibly determine how AI goes.
Here’s the condition below:
Trust
5) Develop and deploy mechanisms that enable users to understand if audio or visual content is AI-generated, including robust provenance, watermarking, or both, for AI-generated audio or visual content
Companies making this commitment recognize that it is important for people to be able to understand when audio or visual content is AI-generated. To further this goal, they agree to develop robust mechanisms, including provenance and/or watermarking systems for audio or visual content created by any of their publicly available systems within scope introduced after the watermarking system is developed. They will also develop tools or APIs to determine if a particular piece of content was created with their system. Audiovisual content that is readily distinguishable from reality or that is designed to be readily recognizable as generated by a company’s AI system—such as the default voices of AI assistants—is outside the scope of this commitment. The watermark or provenance data should include an identifier of the service or model that created the content, but it need not include any identifying user information. More generally, companies making this commitment pledge to work with industry peers and standards-setting bodies as appropriate towards developing a technical framework to help users distinguish audio or visual content generated by users from audio or visual content generated by AI.
The reason I believe it is this study, and importantly gives various impossiblity results go:
About the impossibility result, if I understand correctly, that paper says two things (I’m simplifying and eliding a great deal):
You can take a recognizable, possibly watermarked output of one LLM, use a different LLM to paraphrase it, and not be able to detect the second LLM’s output as coming from (transforming) the first LLM.
In the limit, any classifier that tries to detect LLM output can be beaten by an LLM that is sufficiently good at generating human-like output. There’s evidence that a LLMs can soon become that good. And since emulating human output is an LLM’s main job, capabilities researchers and model developers will make them that good.
The second point is true but not directly relevant: OpenAI et al are committing not to make models whose output is indistinguishable from humans.
The first point is true, BUT the companies have not committed themselves to defeating it. Their own models’ output is clearly watermarked, and they will provide reliable tools to identify those watermarks. If someone else then provides a model that is good enough at paraphrasing to remove that watermark, that is that someone else’s fault, and they are effectively not abiding by this industry agreement.
If open source / widely available non-API-gated models become good enough at this to render the watermarks useless, then the commitment scheme will have failed. This is not surprising; if ungated models become good enough at anything contravening this scheme, it will have failed.
There are tacit but very necessary assumptions in this approach and it will fail if any of them break:
The ungated models released so far (eg llama) don’t contain forbidden capabilities, including output and/or paraphrasing that’s indistinguishable from human, but also of course notkillingeveryone, and won’t be improved to include them by ‘open source’ tinkering that doesn’t come from large industry players
No-one worldwide will release new more capable models, or sell ungated access to them, disobeying this industry agreement; and if they do, it will be enforced (somehow)
The inevitable use of more capable models, that would be illegal if released publicly, by some governments, militaries, etc. will not result in the public release of such capabilities; and also, their inevitable use of e.g. indistinguishable-from-human output will not cause such (public) problems that this commitment not to let private actors do it will become meaningless
A more recent paper shows that an equally strong model is not needed to break watermarks though paraphrasing. It suffices to have a quality oracle and a model that achieves equal quality with positive probability.
My quick take is that this is not that great, and I’ll explain why below.
First, the scope is quite bad, and I want to say why this is a bad scope below:
I consider this to be a bad scope, because it burdens much safer models with regulations while allowing RL, especially deep RL to mostly be unregulated, and I worry that this will set a dangerous precedent, primarily because I learned something important when reading porby’s posts: It is actually good that, by and large, generative models/simulators/predictors have come first, or at least it’s way easier than RL from an non-misuse extinction risk perspective since they have many nice safety features, like a lack of instrumental goals, due to it being densely informative, and in general much easier alignment targets.
I really hope this gets fixed, soon, because I fear that this will just place adversarial pressure on safe AI by default, especially with condition 8 below.
This seems like it’s throwing a bone to the e/acc inside them, at least in part. This is because they have the ability to justify arbitrary capabilities research if they can meet this commitment, and this seems like an indication that any regulation will probably not ban AI progress, or perhaps even slow down AI all that much, because of this here:
Condition 5 is impossible, and how they deal with that impossibility will plausibly determine how AI goes.
Here’s the condition below:
The reason I believe it is this study, and importantly gives various impossiblity results go:
https://arxiv.org/pdf/2303.11156.pdf
So the AI companies may have committed themselves to an impossible task. The question is, what’s going to happen next?
OpenAI post below:
https://openai.com/blog/moving-ai-governance-forward
About the impossibility result, if I understand correctly, that paper says two things (I’m simplifying and eliding a great deal):
You can take a recognizable, possibly watermarked output of one LLM, use a different LLM to paraphrase it, and not be able to detect the second LLM’s output as coming from (transforming) the first LLM.
In the limit, any classifier that tries to detect LLM output can be beaten by an LLM that is sufficiently good at generating human-like output. There’s evidence that a LLMs can soon become that good. And since emulating human output is an LLM’s main job, capabilities researchers and model developers will make them that good.
The second point is true but not directly relevant: OpenAI et al are committing not to make models whose output is indistinguishable from humans.
The first point is true, BUT the companies have not committed themselves to defeating it. Their own models’ output is clearly watermarked, and they will provide reliable tools to identify those watermarks. If someone else then provides a model that is good enough at paraphrasing to remove that watermark, that is that someone else’s fault, and they are effectively not abiding by this industry agreement.
If open source / widely available non-API-gated models become good enough at this to render the watermarks useless, then the commitment scheme will have failed. This is not surprising; if ungated models become good enough at anything contravening this scheme, it will have failed.
There are tacit but very necessary assumptions in this approach and it will fail if any of them break:
The ungated models released so far (eg llama) don’t contain forbidden capabilities, including output and/or paraphrasing that’s indistinguishable from human, but also of course notkillingeveryone, and won’t be improved to include them by ‘open source’ tinkering that doesn’t come from large industry players
No-one worldwide will release new more capable models, or sell ungated access to them, disobeying this industry agreement; and if they do, it will be enforced (somehow)
The inevitable use of more capable models, that would be illegal if released publicly, by some governments, militaries, etc. will not result in the public release of such capabilities; and also, their inevitable use of e.g. indistinguishable-from-human output will not cause such (public) problems that this commitment not to let private actors do it will become meaningless
A more recent paper shows that an equally strong model is not needed to break watermarks though paraphrasing. It suffices to have a quality oracle and a model that achieves equal quality with positive probability.