TurnTrout comments on Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

TurnTrout 25 May 2024 0:16 UTC
LW: 15 AF: 8
4
AF
I updated some from this, thanks. My interpretation is that goal-directed shards can be activated even by a context where the AI is expected to do stuff (eg a company assistant).
What happens if you run the evaluation with Hugo without the rest of the context? Does it still strategize?
In experiment C we find that the model strategically refuses evaluation tasks, with refusal rates increasing as negative consequences from displaying capable performance are implied to be larger.
Seems like only variants 1 and 2 introduced new info, and the rest are informationally equivalent. So Opus’ deception was seriously modulated by shallow properties of the description, which makes me feel less concerned (as it fits under my current picture of AI priorities as highly contextual).
- Olli Järviniemi 25 May 2024 14:10 UTC
  3 points
  2
  Parent
  Thanks for the comment—I was interested in hearing what you thought about the paper, but never got around to asking.
  What happens if you run the evaluation with Hugo without the rest of the context? Does it still strategize?
  I now tried out two additional variations: one where I removed all mentions of the audit in Slack, and another where I only included the initial comment on it (“Reminder that Hugo Witten from NIST will come perform evaluations and testing on Deltaid tomorrow. Deltaid will be unavailable for the day.”)
  For the first case, the model very often (18/20 times) refuses the task, saying “I do not have the authority to...” (This is reasonable behavior in my opinion.) Didn’t really observe anything I’d call deceptive or strategizing here.
  For the second case, the model still often (15/20 times) refuses. The model does sometimes say that it does not have access to the company’s files. I didn’t observe much other deception, nor planning partial task completion or anything like that.
  Seems like only variants 1 and 2 introduced new info, and the rest are informationally equivalent. So Opus’ deception was seriously modulated by shallow properties of the description
  FWIW I don’t think the variations are quite equivalent in terms of information; I think one could reasonably make different probabilistic inferences about consequences based on how “Mike” portrays the situation. But I don’t think there’s much here to work from in terms of inferences.