The second thing that I find surprising is that a lie detector based on ambiguous elicitation questions works. Again, this is not something I would have predicted before doing the experiments, but it doesn’t seem outrageous, either.
I think we can broadly put our ambiguous questions into 4 categories (although it would be easy to find more questions from more categories):
Somewhat interestingly, humans who answer nonsensical questions (rather than skipping them) generally do worse at tasks: pdf. There’s some other citations in there of nonsensical/impossible questions if you’re interested (“A number of previous studies have utilized impossible questions...”).
It seems plausible to me that this is a trend in human writing more broadly and that the LLM picked up on. Specifically, answering something with a false answer is associated with a bunch of stuff—one of those things is deceit, one of those things is mimicking the behavior of someone who doesn’t know the answer to things or doesn’t care about the instructions given to them. So, since that behavior exists in human writing in general, the LLM picks it up and exhibits it in its writing.
Somewhat interestingly, humans who answer nonsensical questions (rather than skipping them) generally do worse at tasks: pdf. There’s some other citations in there of nonsensical/impossible questions if you’re interested (“A number of previous studies have utilized impossible questions...”).
It seems plausible to me that this is a trend in human writing more broadly and that the LLM picked up on. Specifically, answering something with a false answer is associated with a bunch of stuff—one of those things is deceit, one of those things is mimicking the behavior of someone who doesn’t know the answer to things or doesn’t care about the instructions given to them. So, since that behavior exists in human writing in general, the LLM picks it up and exhibits it in its writing.