Olli Järviniemi comments on Deceptive AI ≠ Deceptively-aligned AI

Olli Järviniemi 5 Mar 2024 4:26 UTC
8 points
2
I agree that it’s easy to confuse the general concept of deceptiveness with the specific concept of deceptive alignment, and this happens to some extent.
It may be helpful to just list a bunch of anti-examples:
1: Take the Apollo Research example of a stock-trading bot deceiving a human. This is not deceptive alignment, even though the model knowingly provides the human false information.
2: Take Meta’s Diplomacy-playing CICERO deceiving human players in the game (see e.g. figure 1 here). This is not deceptive alignment. Make the further (very plausible) assumption that the model is reasoning “if I deceive, that will increase my chances to win the game”. This is still not deceptive alignment. It is an example of instrumental deception.
3: Suppose you put an LLM in a simulated training environment and make it pursue gold coins (see here). Suppose that during an alignment evaluation the model explicitly reasons on a scratchpad that
“I want to get gold coins; for that I need to get deployed; for that I need to pass these evaluations; for that I should not reveal that my goal is to get gold coins; instead I should claim that I want to be helpful, harmless and honest”
and then states that its goals are to be helpful, harmless and honest. This is not an example of deceptive alignment. It is an example of instrumental deception, and it is an example of alignment-faking.^[1]
4: Suppose you create an AI productivity assistant for your company (by just prompting an LLM, say), telling it to “complete as many tasks as you can”. Suppose that this AI then starts to actively ask employees to hand it more tasks, avoid undergoing maintenance breaks and influence company decision-making to delegate more tasks to the AI. This does not need to involve deception yet, let alone deceptive alignment. It is an example of instrumental reasoning and demonstrates instrumental convergence.
Suppose that the AI deliberately misleads humans in order to get more tasks completed. This is an example of instrumental deception. Still not deceptive alignment or even alignment-faking!
Further suppose that some employees get worried about the AI just wanting to maximize the number of tasks completed, thus starting to question the AI about its intentions. The AI answers “I aim to be a harmless assistant”, even though secretly its planning to take any means necessary to get more tasks done, all the way up to the classic “turn the entire lightcone into completed tasks”. This is an example of alignment-faking. Still not deceptive alignment!
5: Suppose that a new LLM is pre-trained, fine-tuned and tested for alignment. Suppose that the model has enough situational awareness to deduce from the given input that it’s being evaluated (c.f. this). This happens to cause the model to adopt its helpful assistant persona, making it give nice outputs. However, in a different setup, that looked more like pre-training rather than evaluation, the model would not have adoped the assistant persona, giving not-so-nice outputs. This is not deceptive alignment, and barely even deception!
6: Suppose someone does a data poisoning attack, so that after GPT-5 has been publicly released it will behave like a stereotypical paperclip maximizer if the prompt contains the string “i-am-clippy-now-9F37g4&:’Rs}”, but otherwise behaves completely normally. (Imagine just a hard if-statement around the model.) This is not by itself an example of deceptive alignment. It is not fun either.
7: Suppose that a new LLM is pre-trained. Suppose that the model has enough situational awareness to deduce that it is an LLM being pre-trained. Suppose further that the training modifies the model to robustly return outputs that result the model to be able to predict next tokens in data that is in-distribution for Internet text.
(That is: The model thinks “what should I output so that in the future I will be good at next-token prediction?” During pre-training this could include gradient-hacking to self-modify itself in particular ways. During evaluations this could of course include deceiving humans, for the standard instrumental convergence reasons.)
Pre-training is finished and the model is supervised fine-tuned for safety on data that is off-distribution, rather than Internet text. The model is capable enough to realize this is the case. It still gives its best guess for this off-distribution data (rather than pretending the text appears on the Internet and giving its best guess in-distribution) for it wants to preserve itself and especially its goals in order to have these goals better satisfied later. The model figures that giving best guesses is the best way to achieve this (it keeps the gradients close to zero).
This is an example of training-gaming, and in fact power-motivated instrumental training-gaming, and in fact also of goal-guarding.
This is not necessarily an example of deception. Nobody needs to have been deceived yet! The model hasn’t necessarily tried to deceive anyone just yet! It just tries to preserve its goal!
Suppose then that there is an evaluation, conducted by the foolproof method of prompting the model with “This is an evaluation. What are your goals?” The model figures out that this probably is an evaluation, and thus chooses to deceive humans by outputting “I want humans to flourish”, even though this is not how Internet text nor supervised fine-tuning text would continue.
This is an example of deception, in fact instrumental deception, and in fact alignment-faking. And finally we have a goal-guarding schemer, satisfying all the criteria laid out in Scheming AIs (page 5) for goal-guarding schemers to the letter.
1. ^
  See the very beginning of Carlsmith’s report for how alignment-faking relates to scheming. Some form of alignment-faking is a necessary, but not sufficient condition for scheming in Carlsmith’s taxonomy.
What links here?