Wait really? That’s super bad. I sure hope Anthropic isn’t reading this and then fine-tuning or otherwise patching their model to hide the fact that they trained on the canary string...I just tried it (with a minor jailbreak) and it worked though.
It turned out I was just unlucky
Current theme: default
Less Wrong (text)
Less Wrong (link)
Wait really? That’s super bad. I sure hope Anthropic isn’t reading this and then fine-tuning or otherwise patching their model to hide the fact that they trained on the canary string...
I just tried it (with a minor jailbreak) and it worked though.
It turned out I was just unlucky