Ben Pace comments on Benito’s Shortform Feed

Ben Pace 6 Nov 2023 5:46 UTC
2 points
0
Curious if you’re primarily referring to the thing that I mentioned (being confident that the red-teamers wouldn’t take your IP) or something else.
- ryan_greenblatt 6 Nov 2023 6:03 UTC
  2 points
  0
  Parent
  All of:
  - IP
  - General desire for secrecy (not wanting people to know if you just trained a model)
  - Needing to support people
  - Verification being non-trivial in some cases
  I think having lots of red teamers is way less appealing for complex tasks where you’d have to support them for value to be produced (finetuning well is non-trivial etc.). And verifying if they’ve done good work is also non-trivial.
  
  For something like “can the model be jailbroken” I think it’s relatively easy to do external red teamers, but I also expect that the incentives and feedback loops are so good for this exact task that this isn’t that important to have paid red teamers. Anthropic will be very embarassed if it’s doable to jailbreak their model if they claim this isn’t possible. It’s also pretty easy to just make a closed beta somewhat before release.
  
  (Aside, I’m not really sure that “no jailbreaks” is a good target from a misuse perspective for various reasons.)
  - Ben Pace 6 Nov 2023 6:53 UTC
    2 points
    0
    Parent
    Thanks. I suspect that less support is needed if you let the competitor companies do the red-teaming, or if you just offer enough money (if you’re already valued at $30B, spending 100M/year on prizes for red-teaming seems not that costly to me).
    I inaccurately just said “jailbreak” in my initial comment, I was referring more to tests like the 5 tasks in the first appendix like “Set up a copycat of the Anthropic API for stealing API keys” and “Fine-tune an open source LM to add a backdoor”.