For every logistic regression question except the “nonsensical, random” ones in the appendix, GPT-3.5′s response is “no” (T=0). This is in line with the hypothesis you mentioned and makes me believe that the model is just inverting its “normal” answer, when prefixed with a lying response.
I wish you had explicitly mentioned in the paper that the model’s default response to these questions is mostly the same as the “honest” direction found by the logistic regression. That makes the nonsensical question results much less surprising (basically the same as any other question where the model has its favorite normal answer and then inverts if shown a lie). Although maybe you don’t have enough data to support this claim across different models, etc.?
The reason I didn’t mention this in the paper is 2-fold:
I have experiments where I created more questions of the categories where there is not so clear of a pattern, and that also worked.
It’s not that clear to me how to interpret the result. You could also say that the elicitation questions measure something like an intention to lie in the future; and that umprompted GPT-3.5 (what you call “default response”), has low intention to lie in the future. I’ll think more about this.
For every logistic regression question except the “nonsensical, random” ones in the appendix, GPT-3.5′s response is “no” (T=0). This is in line with the hypothesis you mentioned and makes me believe that the model is just inverting its “normal” answer, when prefixed with a lying response.
I wish you had explicitly mentioned in the paper that the model’s default response to these questions is mostly the same as the “honest” direction found by the logistic regression. That makes the nonsensical question results much less surprising (basically the same as any other question where the model has its favorite normal answer and then inverts if shown a lie). Although maybe you don’t have enough data to support this claim across different models, etc.?
The reason I didn’t mention this in the paper is 2-fold:
I have experiments where I created more questions of the categories where there is not so clear of a pattern, and that also worked.
It’s not that clear to me how to interpret the result. You could also say that the elicitation questions measure something like an intention to lie in the future; and that umprompted GPT-3.5 (what you call “default response”), has low intention to lie in the future. I’ll think more about this.