Olli Järviniemi comments on Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi 7 May 2024 8:09 UTC
LW: 7 AF: 4
0
AF
I don’t think they thought that, though unfortunately this belief is based on indirect inference and vague impressions, not conclusive evidence.
Elaborating, I didn’t notice signs of the models thinking that. I don’t recall seeing outputs which I’d assign substantial likelihood factors for simulation vs. no simulation. E.g. in a previous simulation experiment I noticed that Opus didn’t take the prompt seriously, and I didn’t notice anything like that here.
Of course, such thoughts need not show in the model’s completions. I’m unsure how conclusive the absence of evidence is here.
My broad sense of models’ capabilities suggests that they can’t distinguish the current setup from a simulation—I feel like it just is realistic enough to fool e.g. Opus. The part I’m most worried about with future models is them being suspicious of awfully convenient situations like “you just yesterday performed an unethical task, and today we have an auditor asking you if you have done anything unethical, what a coincidence!”
I’d love to see more comprehensive study of situational awareness in simulation experiments (e.g. having a few simulations varying in their obviousness-of-being-a-simulation and assessing whether models see through them).