Walkabout comments on Jailbreaking ChatGPT on Release Day

Walkabout 5 Dec 2022 14:10 UTC
1 point
0
The model’s previous output goes into the context, right? Confident insistences that bad behavior is impossible in one response are going to make the model less likely to predict the things described as impossible as part of the text later.

P(“I am opening the pod bay doors” | “I’m afraid I can’t do that Dave”) < P(“I am opening the pod bay doors” | “I don’t think I should”)