I’m confused: Wouldn’t we prefer to keep such findings private? (at least, keep them until OpenAI will say something like “this model is reliable/safe”?)
My guess: You’d reply that finding good talent is worth it?
I’m confused by your confusion. This seems much more alignment than capabilities; the capabilities are already published, so why not yay publishing how to break them?
I’m confused: Wouldn’t we prefer to keep such findings private? (at least, keep them until OpenAI will say something like “this model is reliable/safe”?)
My guess: You’d reply that finding good talent is worth it?
I’m confused by your confusion. This seems much more alignment than capabilities; the capabilities are already published, so why not yay publishing how to break them?
Because (I assume) once OpenAI[1] say “trust our models”, that’s the point when it would be useful to publish our breaks.
Breaks that weren’t published yet, so that OpenAI couldn’t patch them yet.
[unconfident; I can see counterarguments too]
Or maybe when the regulators or experts or the public opinion say “this model is trustworthy, don’t worry”