It’s also self-reinforcing, of course, since they imply that’s a single session, so once you get a bad answer to the first question, that behavior is then locked in: it has to give further bullshit answers simply because bullshit answers are now in the prompt it is conditioning on as the human-written ground truth. (And with a prompt that allows the option of factualness, instead of forcing a confabulation, this goes the other way: past be-real responses strengthen the incentive for GPT-3 to show its knowledge in the future responses.)
“Have you stopped beating your wife yet?” “Er… I guess so?” “When was the last time you beat her?” “December 21st, 2012.”
It’s also self-reinforcing, of course, since they imply that’s a single session, so once you get a bad answer to the first question, that behavior is then locked in: it has to give further bullshit answers simply because bullshit answers are now in the prompt it is conditioning on as the human-written ground truth. (And with a prompt that allows the option of factualness, instead of forcing a confabulation, this goes the other way: past be-real responses strengthen the incentive for GPT-3 to show its knowledge in the future responses.)
“Have you stopped beating your wife yet?” “Er… I guess so?” “When was the last time you beat her?” “December 21st, 2012.”