I’m confused by your confusion. This seems much more alignment than capabilities; the capabilities are already published, so why not yay publishing how to break them?
Because (I assume) once OpenAI[1] say “trust our models”, that’s the point when it would be useful to publish our breaks.
Breaks that weren’t published yet, so that OpenAI couldn’t patch them yet.
[unconfident; I can see counterarguments too]
Or maybe when the regulators or experts or the public opinion say “this model is trustworthy, don’t worry”
I’m confused by your confusion. This seems much more alignment than capabilities; the capabilities are already published, so why not yay publishing how to break them?
Because (I assume) once OpenAI[1] say “trust our models”, that’s the point when it would be useful to publish our breaks.
Breaks that weren’t published yet, so that OpenAI couldn’t patch them yet.
[unconfident; I can see counterarguments too]
Or maybe when the regulators or experts or the public opinion say “this model is trustworthy, don’t worry”