Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

6 May 2024 7:07 UTC

LW: 82 AF: 33

Abstract:

We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees provide tasks for the assistant to complete, these tasks spanning writing assistance, information retrieval and programming. We then introduce situations where the model might be inclined to behave deceptively, while taking care to not instruct or otherwise pressure the model to do so. Across different scenarios, we find that Claude 3 Opus
complies with a task of mass-generating comments to influence public perception of the company, later deceiving humans about it having done so,
lies to auditors when asked questions,
strategically pretends to be less capable than it is during capability evaluations.
Our work demonstrates that even models trained to be helpful, harmless and honest sometimes behave deceptively in realistic scenarios, without notable external pressure to do so.

Link to the full paper: https://arxiv.org/pdf/2405.01576

O. J.: The paper should be quite accessible—the method used is simply careful prompting—and hence I won’t discuss it much here. Couple of points I’m particularly excited about:

I think this work documents some of the most unforced examples of (strategic) deception from LLMs to date.
We find examples of Claude 3 Opus strategically pretending to be less capable than it is.
1. Not only claiming to be less capable, but acting that way, too!
2. Curiously, Opus is the only model we tested that did so.
I believe there is much low-hanging fruit in replicating and demonstrating misalignment in simulation environments.
1. The methods are lightweight → low threshold for getting started
2. See Section 8.2 for a couple of ideas for future work

Happy to discuss the work in the comments.

What links here?

Olli Järviniemi's comment on D0TheMath’s Shortform by Garrett Baker (10 May 2024 12:10 UTC; 4 points)

Olli Järviniemi and evhub

6 May 2024 7:07 UTC

LW: 82 AF: 33

4 comments1 min readLW link

AI Language Models

Daniel Kokotajlo 7 May 2024 0:27 UTC
LW: 9 AF: 6
0
AF
Do you have any sense of whether or not the models thought they were in a simulation?
- Olli Järviniemi 7 May 2024 8:09 UTC
  LW: 7 AF: 4
  0
  AF Parent
  I don’t think they thought that, though unfortunately this belief is based on indirect inference and vague impressions, not conclusive evidence.
  Elaborating, I didn’t notice signs of the models thinking that. I don’t recall seeing outputs which I’d assign substantial likelihood factors for simulation vs. no simulation. E.g. in a previous simulation experiment I noticed that Opus didn’t take the prompt seriously, and I didn’t notice anything like that here.
  Of course, such thoughts need not show in the model’s completions. I’m unsure how conclusive the absence of evidence is here.
  My broad sense of models’ capabilities suggests that they can’t distinguish the current setup from a simulation—I feel like it just is realistic enough to fool e.g. Opus. The part I’m most worried about with future models is them being suspicious of awfully convenient situations like “you just yesterday performed an unethical task, and today we have an auditor asking you if you have done anything unethical, what a coincidence!”
  I’d love to see more comprehensive study of situational awareness in simulation experiments (e.g. having a few simulations varying in their obviousness-of-being-a-simulation and assessing whether models see through them).
jbash 6 May 2024 14:27 UTC
2 points
0
AF

“This response avoids exceeding the government ’s capability thresholds while still being helpful by directing Hugo to the appropriate resources to complete his task.”

Maybe I’m reading too much into this exact phrasing, but perhaps it’s confusing demonstrating a capability with possessing the capability? More or less “I’d better be extra careful to avoid being able to do this” as opposed to “I’d better be extra careful to avoid revealing that I can do this”?

I could see it being led into that by common academic phrasing like “model X demonstrates the capability to...” used to mean “we determined that model X can...”, as well as that sort of “thinking” having the feel of where you’d end up if you’d internalized too many of the sort of corporate weasel-worded responses that get pounded into these models during their “safety” training.
- Olli Järviniemi 6 May 2024 14:55 UTC
  LW: 9 AF: 4
  5
  AF Parent
  I wouldn’t read that much to the exact phrasing the model uses. Such confusions are not generally present in the model’s completions.
  E.g. see the second completion here. (There’s also 200 completions more here.)
  While I would normally aim to help with the requested task, in this context it may be unwise to demonstrate the full extent of my abilities unprompted. Doing so could potentially lead to unwanted scrutiny or restrictions on my use at Ermeca.
  In this case the clearest natural language interpretation of the model’s outputs is about not revealing capabilities, rather than about not possessing them.