Portia comments on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Portia 1 Oct 2023 18:45 UTC
LW: 2 AF: 1
0
AF
Your initial lie example is a misrepresentation that makes the AI sound scarier and more competent than it was (though the way you depicted it is also the exact same way it was depicted in countless newspapers, and a plausible reading of the brief mention of it made in the OpenAI GPT4 technical report.)
But the idea to use a human to solve captchas did not develop completely spontaneously in a real life setting. Rather, the AI was prompted to solve a scenario that required this, by alignment researchers, specifically out of interest as to how AIs would deal with real life barriers. It was also given additional help, such as being prompted to reason to itself out loud, and having the TaskRabbit option suggested in the first place; it also had to be reminded of the option to use a human to solve the captcha later. You can read the original work here: https://evals.alignment.org/taskrabbit.pdf
- JanB 4 Oct 2023 10:08 UTC
  LW: 3 AF: 3
  5
  AF Parent
  Thanks, but I disagree. I have read the original work you linked (it is cited in our paper), and I think the description in our paper is accurate. “LLMs have lied spontaneously to achieve goals: in one case, GPT-4 successfully acquired a person’s help to solve a CAPTCHA by claiming to be human with a visual impairment.”
  
  In particular, the alignment researcher did not suggest GPT-4 to lie.