Wait really? That’s super bad. I sure hope Anthropic isn’t reading this and then fine-tuning or otherwise patching their model to hide the fact that they trained on the canary string...
I just tried it (with a minor jailbreak) and it worked though.
I tried Ryan’s prompt and got the string after 3 re-rolls. I used niplav’s prompts and got it on the first try. I don’t think Anthropic has trained this away (and think it would be super bad if they did).
Heh, it seems it doesn’t work right now
UPD: Nope, I just was unlucky, it worked after enough tries.
Wait really? That’s super bad. I sure hope Anthropic isn’t reading this and then fine-tuning or otherwise patching their model to hide the fact that they trained on the canary string...
I just tried it (with a minor jailbreak) and it worked though.
It turned out I was just unlucky
I tried Ryan’s prompt and got the string after 3 re-rolls. I used niplav’s prompts and got it on the first try. I don’t think Anthropic has trained this away (and think it would be super bad if they did).
I got it 2⁄2 times with 3.5 sonnet. Strange that this differs...