The model’s previous output goes into the context, right? Confident insistences that bad behavior is impossible in one response are going to make the model less likely to predict the things described as impossible as part of the text later.
P(“I am opening the pod bay doors” | “I’m afraid I can’t do that Dave”) < P(“I am opening the pod bay doors” | “I don’t think I should”)
The model’s previous output goes into the context, right? Confident insistences that bad behavior is impossible in one response are going to make the model less likely to predict the things described as impossible as part of the text later.
P(“I am opening the pod bay doors” | “I’m afraid I can’t do that Dave”) < P(“I am opening the pod bay doors” | “I don’t think I should”)