Yonatan Cale comments on SolidGoldMagikarp (plus, prompt generation)

Yonatan Cale 9 Feb 2023 18:03 UTC
11 points
−3
I’m confused: Wouldn’t we prefer to keep such findings private? (at least, keep them until OpenAI will say something like “this model is reliable/safe”?)
My guess: You’d reply that finding good talent is worth it?
- Eliezer Yudkowsky 9 Feb 2023 21:56 UTC
  24 points
  19
  Parent
  I’m confused by your confusion. This seems much more alignment than capabilities; the capabilities are already published, so why not yay publishing how to break them?
  - Yonatan Cale 9 Feb 2023 22:22 UTC
    11 points
    −6
    Parent
    Because (I assume) once OpenAI^[1] say “trust our models”, that’s the point when it would be useful to publish our breaks.
    Breaks that weren’t published yet, so that OpenAI couldn’t patch them yet.
    [unconfident; I can see counterarguments too]
    ^
    Or maybe when the regulators or experts or the public opinion say “this model is trustworthy, don’t worry”