Adam Jermyn comments on Jailbreaking ChatGPT on Release Day

Adam Jermyn 3 Dec 2022 5:24 UTC
2 points
0
Roughly, I think it’s hard to construct a reward signal that makes models answer questions when they know the answers and say they don’t know when they don’t know. Doing that requires that you are always able to tell what the correct answer is during training, and that’s expensive to do. (Though Eg Anthropic seems to have made some progress here: https://arxiv.org/abs/2207.05221).