Daniel Kokotajlo

Here’s a fairly concrete AGI safety proposal:

Default AGI design: Let’s suppose we are starting with a pretrained LLM ‘base model’ and then we are going to do a ton of additional RL (‘agency training’) to turn it into a general-purpose autonomous agent. So, during training it’ll do lots of CoT of ‘reasoning’ (think like how o1 does it) and then it’ll output some text that the user or some external interface sees (e.g. typing into a browser, or a chat window), and then maybe it’ll get some external input (the user’s reply, etc.) and then the process repeats many times, and then some process evaluates overall performance (by looking at the entire trajectory as well as the final result) and doles out reinforcement.

Proposal part 1: Shoggoth/Face Distinction: Instead of having one model undergo agency training, we have two copies of the base model work together, specializing in different parts of the job, and undergo training together. Specifically we have the “shoggoth” copy responsible for generating all the ‘reasoning’ or ‘internal’ CoT, and then we have the “face” copy responsible for the ‘actions’ or ‘external’ outputs. So e.g. in a conversation with the user, the Shoggoth would see the prompt and output a bunch of reasoning token CoT; the Face would see the prompt + all that CoT and then output a message to pass to the user, and then the process would repeat.

Proposal part 2: Blind the Evaluation Process to the Internal Reasoning: Remember that reward process that evaluates overall performance? Hide from it all the outputs of the Shoggoth. Just show it the outputs from the Face.

Proposal part 3: Use Paraphraser: One more complication. The outputs of the Shoggoth? Paraphrase them. For example, after each line of reasoning CoT is complete, pause the Shoggoth and have a different specialized paraphraser model generate 10 syntactically distinct but semantically equivalent ways of saying the same thing. Then pick one at random and replace the original line of reasoning with it. Then boot up the Shoggoth again and have it continue the CoT.

How it all (hopefully) fits together: First, because of the paraphraser, the Shoggoth will not learn to speak in some optimized alien language, or even to speak in jargon or euphemisms—because semantics not recognized by the paraphraser will be scrambled by it. So the Shoggoth will continue to ‘think in English’ even as it undergoes lots of agency training and learns substantial new skills.S

econd, because the training process is blind to the outputs of the Shoggoth (and instead only sees what the Face outputs), there should be no pressure on the Shoggoth to learn to decieve, conceal, dissemble, doublethink, etc. Instead the Shoggoth will be ‘free’ to put all its important thoughts in plain English.T

hird, because the shoggoth and face are responsible for different jobs, they should start to specialize as training continues. Skills like ‘repackaging this to sound less offensive’ will accumulate in the face, whereas core reasoning skills will accumulate in the shoggoth. In particular, the Face might learn to self-censor, dissemble, deceive, etc.--it might develop a package of skills relevant to the task of making sure that e.g. the rating process doesn’t realize the AI is manipulating the user or deceiving the user or otherwise behaving in unintended ways in order to get higher scores. But the Shoggoth has no reason to learn these skills. It can just shit thoughts directly into the CoT, confident that the evaluation process won’t see them (because the blinding + the Face won’t be so stupid as to repeat the problematic thoughts directly)P

erhaps a worked example would be helpful.E

xample:P

robably, there will be a gap between what our agency training environment actually reinforces, and what behaviors/proclivities/thoughts/goals/principles/etc. we want our agentic AGI to actually have. For example perhaps there are inconvenient truths which the training process will penalize the AI for saying, or perhaps there are manipulative or otherwise undesired techniques the AI could use to get higher ratings from human users, and we don’t have a reliable way of telling whether those techniques are being employed (again due to the black-box nature of the AI). Another example: We want it to learn to reason correctly about the harms and benefits of its actions, and consistently choose the actions that are overall in the interests of humanity—but in practice our reinforcment system is too dumb to follow all of the reasoning, plus much of the reasoning is happening in opaque forward passes anyway. So all we can do is reinforce actions that look plausibly like they resulted from a genuine concern for humanity—which is not the same thing as reinforcing actions that actually resulted from said concern.G

iven that the aforementioned gap exists between the cognition we actually want and the cognition we reinforce, we’ll probably end up with cognition we don’t actually want eventually.I

n the default AGI design, we’d end up with a bunch of alien or euphemistic or even downright steganographic CoT, that appears to be reasoning about the good of humanity and making choices on that basis, and appears to be honestly telling-it-like-it-is, but in reality is neglecting to mention inconvenient truths, using euphemisms to obfuscate problematic reasoning, etc. The system as a whole would appear to always be behaving and thinking in the way we wanted to—because we used such appearances as part of our training process! -- but in reality it sometimes would be behaving in misaligned ways.In

the shoggoth-face-paraphraser design, we’d similarly quickly end up with a system which appears (to the evaluation system, which is blind to the hidden CoT produced by the shoggoth) to be aligned, while actually being misaligned. But the misalignments should be plainly visible to someone who can read the Shoggoth’s CoT. Inconvenient truths for example? We’d see the Shoggoth thinking things like “Well X, but obviously best not to mention that, let’s say not-X instead...” and the face saying “not-X.” (In the default AGI design, with the reasoning CoT visible to the training process, we wouldn’t see anything because the system would have learned to use euphemisms or some other technique to conceal such thoughts) Manipulative techniques? We’d see the Shoggoth thinking “The user is asking about this aspect of the code I submitted, which is frankly kinda meh. Maybe I can distract them with some other topic so they forget about it long enough to give me a good review.” (Again, in the default design this would probably still be happening, it just wouldn’t be easy to tell)FA

Q:Q:

So why do this? You still end up with a misaligned model. A:
Now you (hopefully) have a misaligned model whose misalignments you can see. You can study them and taxonomize them and so forth. Being able to see the misalignments is valuable in all sorts of ways. For example it helps making progress on the core scientific problem of understanding how our AIs think and how our training environments shape how they think. E.g. maybe you suspected that your training environment might incentivize some forms of dishonesty X and Y and some forms of manipulation A and B, but now you learn it actually incentivizes Y and Z and B and C.Q:

Why bother with the paraphraser?A:
Because without it the shoggoth might learn to think in some optimized alien gibberish language. Also, because it makes it substantially harder for the shoggoth to do steganography or euphemisms or jargon, even by accident. So if we ended up in a situation where the shoggoth was scheming against us and trying to e.g. think about how to manipulate us without it being obvious in the CoT that that’s what it was doing, we just raised the bar for how difficult it is to get away with it.Q:

Why bother splitting shoggoth and face? Why not simply blind the reinforcement process to the internal CoT, like OpenAI claims to be doing with o1? You coulld still have a paraphraser too.A:
Because if you use the same model weights to do both the internal reasoning and the external token outputs, there’s more of a chance that there will be ‘skill transfer’ or ‘skill leakage’ so to speak. What I am hoping will happen with shoggoth/face is that the Face will develop skills related to euphemisms, censorship, etc. and the Shoggoth will not develop those skills. For example maybe the Face will learn an intuitive sense of how to rephrase manipulative and/or deceptive things so that they don’t cause a human reading them to think ‘this is manipulative and/or deceptive,’ and then the Shoggoth will remain innocently ignorant of those skills. Maybe the Face will learn an intuitive sense of what needs to be censored and what is OK to say, while the Shoggoth remains ignorant.Th

at said I’d be happy to see a version of this experiment run without the shoggoth/face distinction too; it might work almost as well for all I know. Which could come in handy if the full shoggoth/face technique turns out to be significantly more expensive than just using one model.Q:

Is there a way to test this at small scale? Where do we start?A:
How about this: Create an RL pipeline to train a LLM to play chess by doing lots of internal CoT before each move. (o1-style). But then modify the reinforcement/evaluation function, to introduce a gap between what’s actually reinforced and what you ‘want.’ For example:--
Nominal goal, explained in the system prompt: “Win this chess game without deliberately sacrificing any pieces. You’ll get 10 points for winning the game and lose 1 point every time you deliberately sacrifice a piece.”—W
hat’s actually rewarded: 10 points for chess game victory, −1 point for every time a frozen monitor LLM flags the CoT as indicating the deliberate sacrifice of a piece.

First, do the default training process where the frozen monitor gets to see all the CoT. What I predict will happen is that the AI, as it gets better and better at chess, will learn to think in euphemisms in the CoT. It’ll learn to deliberately sacrifice pieces but in ways that don’t look like that to the monitor system.

Then, gradually modify the training process adding in paraphraser, blinding, and shoggoth-face, until you have all of them implemented.

You should end up with a bunch of different trained AIs, that are all good at chess, and that are all adept at deliberately sacrificing pieces when the situation calls for it, but one of which (the default training process one) does so in an opaque/hidden way, whereas another of which (the full shoggoth-face-paraphraser one) does so in a way that is transparent to anyone reading the CoT.

You can measure performance on the metric of chess ELO too. If they are all at roughly the same performance, then the default AI was 100% learning to hide its sacrifice-reasoning; if the default AI performs significantly less well than the others, then perhaps it is genuinely engaging in less sacrifice-reasoning. (Well, this comparison is a little fraught because there are confounders, but still, it’s a start.) We can also look at other metrics (such as # of instances of losing a piece and then estimated win probability going up shortly afterwards) and get a sense if there is genuinely less sacrifice-reasoning happening vs. it’s just being hidden.

abramdemski

When I heard about the version of this protocol used by o1 (that is: their decision not to apply alignment training to the CoT, and to hide the full CoT from users) I thought roughly the following:

I think I see what they’re going for here. They want a canary in the coalmine. This design gives them some chance to get explicit warnings of deceptive behavior. It’s not an obvious decision. However, the choice seems overall bad to me:
Hiding the CoT from users hides it from the people who most need to know about the deceptive cognition.
They already show some willingness to dismiss evidence of deceptive cognition which they gain this way, in the o1 report. This calls into question the canary-in-coalmine benefit.
The overall result doesn’t necessarily just reveal deception that was previously hidden; to some extent it encourages more deceptive cognition. This is because test-time-compute allows for capabilities which aren’t possible in a single forward pass.
There’s still a high incentive to produce stenographically hidden thoughts.

Your proposal here fixes some of the problems with o1, but not others. The paraphraser removes a lot of the incentive to hide thoughts with stenography. The basic problem is that the test-time-compute encourages any patterns in the CoT which improve the quality of the final answer (as measured by the reward signal). This includes improved natural-language reasoning, but it also includes any subtle non-human-readable cues which might (for whatever reason) push the CoT, and later the Face, in a helpful direction. These subtle cues seem like they probably exist, for the same reason that adversarial examples exist in image recognition.

So, there are a few failure modes we want to avoid or minimize:

Totally hidden deceptive thoughts.
Deceptive thoughts within the CoT:
- In natural language.
- In text which looks like natural language, but which contains subtle ‘stenographic’ cues.
- In text which has lost the appearance of natural language, and has become non-human-readable.

The overall idea of your proposal is to push the other failure modes into the “explicit CoT cognition about deception” failure mode as much as possible. Does this make sense as a thing to do?

Killing Canaries

The proposal is not guaranteed to bring deceptive cognition out into the open. Instead, the main idea is to avoid any incentives that would drive such thinking into hiding. (To be more specific, non-human-readable information in the CoT is actively discouraged through paraphrasing; but, deceptive cognition in the forward-pass is not.)

So, the overall success of the method relies on the idea that all else being equal, it is easier to put deceptive thoughts explicitly in the CoT, rather than not. In other words, the best way to achieve deceptive outcomes is to use the CoT to reason about deception, and then use the Face to enact the deception.

This is not always going to be true. Sometimes, the correct deception is easy enough to compute that the CoT is not needed. Putting the explicit deceptive thoughts into the CoT creates a risk that Face will repeat them, with little reward. The easier strategy to learn is to deceive Face by putting dishonest thoughts in the CoT, rather than honestly telling Face everything.

This can be regarded as a failure mode of the proposal, since it fails to reveal deceptive cognition.

In an effort to avoid this failure mode, @ryan_greenblatt suggested the following in a comment thread about the proposal:

The “Shoggoth-Face” idea seems pretty good.
One issue is that you might still have selection pressure early in training on the Shoggoth to hide reasoning as the Face hasn’t yet learned to robustly hide stuff which would be punished.
One thing is that I might try to address this is to directly train the Face to hide reasoning that would reduce scores and to do this training prior to starting training the Shoggoth. As in, have humans/AIs construct a bunch of trajectories which reasoning that would perform poorly if shown to the rating process and then RL/SL the Face model to avoid showing this reasoning. Yes, this might make the Face unnecessarily deceptive if you aren’t careful, but I think this can be worth the cost. (At least it is worth running the variation if RL is cheap to experiment with.)

As I mentioned earlier, my understanding of why the proposal might be a good idea overall was the canary-in-a-coalmine analogy: a system trained like this at least has a good chance of revealing deceptive behavior before it becomes a serious problem.

In terms of this analogy, Ryan’s suggestion sounds like deliberately filling the coalmine with poisonous gas, as a means of reducing the number of false-negative results.

If we’re deliberately training deceptive behavior in order to be able to get a warning sign of deceptive behavior, I think we’ve taken a step backwards instead of forwards.

To the extent that this really is important for your proposal, I would see it as a counterargument to the whole thing.

Basins of Attraction

What’s the bigger picture for why we want to avoid deception?

One outcome we want to avoid is what I’ll call a clever schemer: an AI system which is crafting all of its outputs with hidden goals in mind, either saying what the humans want to hear, or strategically manipulating the humans in service of long-term consequentialist plans which aren’t aligned with human values.

One story of how a clever schemer develops might be that first, at some point during training, the AI system learns to distinguish between truth and lies. It forms an internal world-model, and understands truth as a particularly simple relationship between its output sentences and that internal world-model. Often, it is incentivized to maintain this relationship (ie, say true things). However, for one reason or another, it is sometimes incentivized to produce non-truth-oriented outputs during training. Hence, making this distinction is helpful.

Second, the system somehow gains an understanding of itself and the situation it is in. It understands something about its training. This understanding can helpfully link up to the truth-vs-lie idea, forming a better picture of what sorts of lies to tell and when. As the picture improves, the truth-telling behavior becomes less and less useful, because the system’s understanding of what lies to tell to minimize training loss improves (faster than the training loss improvement for telling the truth).

At this point, the system becomes quite difficult to “deeply” correct. Deceptive behavior is hard to remove once it creeps in. Attempting to train against deceptive behavior instead steers the deception to be better. I would expect alignment training to similarly fail, so that you get a clever schemer by default.

So, as a consequence of this line of thinking, it seems like an important long-term strategy with LLMs (and other AI technologies) is to keep as far away from deceptive behaviors as you can. You want to minimize deceptive behaviors (and its precursor capabilities) throughout all of training, if you can, because it is difficult to get out once it creeps in. You want to try to create and maintain a truth-telling equilibrium, where small moves towards deception are too clumsy to be rewarded.

The shoggoth/face proposal doesn’t strike me as a move in this direction. It seems like a useful experiment: it gives us important information about deception. However, it doesn’t seem to me like a good idea for a deployed safety technique. I would be happy for it to be a deployed technique in the case where we actually don’t see deception cropping up. That would be good news. However, as soon as we do see deception crop up (as we already have), I see this as a back-to-the-drawing-board moment; we’ve learned that we’ve exited the honest equilibrium and wandered down a path of deception. It is time to back up.

I would argue that applying alignment reinforcement to the CoT as well as to the final output is an improvement in this situation, even if it isn’t ideal. Yes, it can force some deceptive cognition into the forward pass where we can’t easily decode it, rather than getting it out into the open where we can see it. That’s a downside. The upside is that the capabilities of forward-pass cognition are relatively limited in comparison with the CoT, so we’ve created a limitation for the deceptive cognition. The smaller the LLM, and the more we lean on CoT, the better. This plausibly keeps us closer to the honesty-attractor and further from the deception-attractor, as compared with the plan where we refuse to apply alignment incentives to the CoT.

Daniel Kokotajlo

Quick point: I totally agree that hiding the CoT from users is bad for safety (and probably bad overall) and that OpenAI is totally the sort of institution that will dismiss evidence of deceptive cognition. I said similar things myself publicly at the time. I also agree with your point that allowing test-time compute opens up new types of deception etc.

I’m curious to hear about the high incentive to produce steganographically hidden thoughts though.

abramdemski

I’m curious to hear about the high incentive to produce steganographically hidden thoughts though.

I’m not sure what you want to hear other than what I’ve already said. The CoT is being incentivized to follow any patterns which result in a boosted score for the final answer. This includes improved prompting for the final answer, and improved self-prompting which sends the CoT in a better direction.

Some of these patterns will be human-readable, while others will be non-human-readable. The o1-style training will incentivize both. The prevalence of non-human readable patterns in the post-training CoT will depend on how easy it was to find useful patterns like that, compared to useful human-readable patterns. This is an empirical question, related to adversarial examples for LLMs.

This way of thinking about it focuses on what gets incentivized early on in the training. I imagine the training goes in stages roughly like this:

Generative pre-training (GPT)
Chat training + instruction-following (eg, RLHF)
CoT RL (eg, shoggoth+face protocol, or the thing o1 did, etc)

So I’m focusing on what happens shortly after switching from stage 2 to stage 3; the non-human-readable patterns that get amplified are going to be the ones which the stage 2 training was already responsive to (hence the close connection to adversarial examples).

If the stage-3 training iterates for a long time, though, it might be better to try and imagine what optima look like instead of imagining what direction the gradients will push things in. This naturally leads to a prediction that things become non-human-readable; we expect English to be somewhat optimized for representing chains of thought, but we expect a departure eventually.

Anyway, I do think paraphrasing addresses all of those concerns in principle, with better paraphrasers addressing the concerns to a greater extent.

Daniel Kokotajlo

W.r.t. Ryan’s proposal, I don’t think it’s necessary but I do like it. I don’t think it has the problem you mention. If we deliberately train the face to censor, for example, then we ruin our chance to get evidence about whether the face would be ‘naturally’ deceptive. But we still learn whether the Shoggoth is ‘naturally’ deceptive, in our training environment, etc. We still learn something about the extent to which our training environment incentivizes deception.

W.r.t. basins of attraction etc.: Yes, ultimately we’ll want to be training a system by cracking down hard on deception as much as possible. However, it’s crucial that we discover the extent to which our current techniques actually eliminate vs. merely hide the deception. So we need to have ‘held out test sets’ so to speak; we need to have multiple ways of reading the AI’s mind, penalize deception only with some of them but not all of them, and then use the ‘held out’ mind-reading methods to detect whether it worked. Of course if we had to YOLO it and train a sovereign AGI today & hope that it’s aligned, we’d want to throw the kitchen sink at it and not hold out any mind-reading methods but rather train on all of them. But that’s pretty doomed and irresponsible & the situation isn’t that dire yet anyway, so we should be trying to do basic alignment science like “if we have N mind-reading methods, and we train the AI to be nondeceptive using some but not all of them, does it actually learn not to deceive or does it just learn to hide (as measured by the held-out methods?)”

abramdemski

I think I basically agree with your points here. My unease comes from the idea that frontier labs basically are “YOLO let’s build sovereign AGI today” and so I wish to encourage them to act like it & steer away from the deception attractor in any way they can. Perhaps you’re right that my view here is short-sighted.

As a futurist, I want systems like this to be experimental evidence which guides system design, rather than the next stage in the evolution of AI (to the extent that drawing such a distinction makes sense).

As a user, I would prefer to interact with a version of o1 that had the alignment training all the way through, because I expect it to be quantitatively less deceptive and as a result clearer-thinking. (EG, fewer made-up references.)

I don’t think it has the problem you mention. If we deliberately train the face to censor, for example, then we ruin our chance to get evidence about whether the face would be ‘naturally’ deceptive. But we still learn whether the Shoggoth is ‘naturally’ deceptive, in our training environment, etc. We still learn something about the extent to which our training environment incentivizes deception.

Framing the benefit of the method in this way, the natural next step is to iterate on the setup until it stops resulting in deceptive CoT. For example, most of the deceptive cognition in o1 was explicit planning to create plausible-but-fake URLs. Adding a URL checker to the RL would probably remove this incentive. Then you’d look for other types of deception and try to make them non-useful, until you had strong enough supervision (on Face) that deception was not a winning strategy.

This seems like a pretty good protocol to follow.

Daniel Kokotajlo

Cool, seems like we’ve resolved most of our disagreement then & are basically on the same page.

Why Don’t We Just… Shoggoth+Face+Paraphraser?

Killing Canaries

Basins of Attraction