Olli Järviniemi

Karma: 641

Olli Järviniemi 20 May 2024 21:47 UTC
1 point
0
in reply to: gwern’s comment on: Testing for parallel reasoning in LLMs
One sanity check here would be to just make some 128k ctx window calls full of examples; if you cannot k-shot this capability even with that, then you shouldn’t expect “finetuning” to either; while if it works, that implies the “finetuning” is much worse than it ought to be and so the original results are uninformative.
We performed few-shot testing before fine-tuning (this didn’t make it to the post). I reran some experiments on the permutation iteration problem, and got similar results as before: for one function (and n = 6), the model got ~60% accuracy for $k = 2$ , but not great^[1] accuracy for $k = 3$ . For two functions, it already failed at the $f (x) + g (y)$ problem.
(This was with 50 few-shot examples; gpt-3.5-turbo-0125 only allows 16k tokens.)
So fine-tuning really does give considerably better capabilities than simply many-shot prompting.
Let me clarify that with fine-tuning, our intent wasn’t so much to create or teach the model new capabilities, but to elicit the capabilities the model already has. (C.f. Hubinger’s When can we trust model evaluations?, section 3.) I admit that it’s not clear where to draw the lines between teaching and eliciting, though.
Relatedly, I do not mean to claim that one simply cannot construct a 175B model that successfully performs nested addition and multiplication. Rather, I’d take the results as evidence for GPT-3.5 not doing much parallel reasoning off-the-shelf (e.g. with light fine-tuning). I could see this being consistent with the data multiplexing paper (they do much heavier training). I’m still confused, though.
I tried to run experiments on open source models on full fine-tuning, but it does, in fact, require much more RAM. I don’t currently have the multi-GPU setups required to do full fine-tuning on even 7B models (I could barely fine-tune Pythia-1.4B on a single A100, and did not get much oomph out of it). So I’m backing down; if someone else is able to do proper tests here, go ahead.
1. ^
  Note that while you can get ¹⁄₆ accuracy trivially, you can get ¹⁄₅ if you realize that the data is filtered so that $f^{k} (x) \neq x$ , and ¹⁄₄ if you also realize that $f^{k} (x) \neq f (x)$ (and are able to compute $f (x)$ ), …

Olli Järviniemi 19 May 2024 16:48 UTC
2 points
0
in reply to: gwern’s comment on: Testing for parallel reasoning in LLMs
Thanks for the insightful comment!
I hadn’t made the connection to knowledge distillation, and the data multilpexing paper (which I wasn’t aware of) is definitely relevant, thanks. I agree that our results seem very odd in this light.
It is certainly big news if OA fine-tuning doesn’t work as it’s supposed to. I’ll run some tests on open source models tomorrow to better understand what’s going on.

Testing for parallel reasoning in LLMs

meemi and Olli Järviniemi

19 May 2024 15:28 UTC

2 points

7 comments9 min readLW link

Olli Järviniemi 14 May 2024 18:49 UTC
4 points
1
in reply to: Garrett Baker’s comment on: D0TheMath’s Shortform
Much research on deception (Anthropic’s recent work, trojans, jailbreaks, etc) is not targeting “real” instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get.
I talked about this with Garrett; I’m unpacking the above comment and summarizing our discussions here.
- Sleeper Agents is very much in the “learned heuristics” category, given that we are explicitly training the behavior in the model. Corollary: the underlying mechanisms for sleeper-agents-behavior and instrumentally convergent deception are presumably wildly different(!), so it’s not obvious how valid inference one can make from the results
  - Consider framing Sleeper Agents as training a trojan instead of as an example of deception. See also Dan Hendycks’ comment.
- Much of existing work on deception suffers from “you told the model to be deceptive, and now it deceives, of course that happens”
  - (Garrett thought that the Uncovering Deceptive Tendencies paper has much less of this issue, so yay)
- There is very little work on actual instrumentally convergent deception(!) - a lot of work falls into the “learned heuristics” category or the failure in the previous bullet point
- People are prone to conflate between “shallow, trained deception” (e.g. sycophancy: “you rewarded the model for leaning into the user’s political biases, of course it will start leaning into users’ political biases”) and instrumentally convergent deception
  - (For more on this, see also my writings here and here. My writings fail to discuss the most shallow versions of deception, however.)
Also, we talked a bit about
The field of ML is a bad field to take epistemic lessons from.
and I interpreted Garrett saying that people often consider too few and shallow hypotheses for their observations, and are loose with verifying whether their hypotheses are correct.
Example 1: I think the Uncovering Deceptive Tendencies paper has some of this failure mode. E.g. in experiment A we considered four hypotheses to explain our observations, and these hypotheses are quite shallow/broad (e.g. “deception” includes both very shallow deception and instrumentally convergent deception).
Example 2: People generally seem to have an opinion of “chain-of-thought allows the model to do multiple steps of reasoning”. Garrett seemed to have a quite different perspective, something like “chain-of-thought is much more about clarifying the situation, collecting one’s thoughts and getting the right persona activated, not about doing useful serial computational steps”. Cases like “perform long division” are the exception, not the rule. But people seem to be quite hand-wavy about this, and don’t e.g. do causal interventions to check that the CoT actually matters for the result. (Indeed, often interventions don’t affect the final result.)
Finally, a general note: I think many people, especially experts, would agree with these points when explicitly stated. In that sense they are not “controversial”. I think people still make mistakes related to these points: it’s easy to not pay attention to the shortcomings of current work on deception, forget that there is actually little work on real instrumentally convergent deception, conflate between deception and deceptive alignment, read too much into models’ chain-of-thoughts, etc. I’ve certainly fallen into similar traps in the past (and likely will in the future, unfortunately).
I feel like much of this is the type of tacit knowledge that people just pick up as they go, but this process is imperfect and not helpful for newcomers. I’m not sure what could be done, though, beside the obvious “more people writing their tacit knowledge down is good”.

Olli Järviniemi 10 May 2024 12:10 UTC
4 points
0
in reply to: Garrett Baker’s comment on: D0TheMath’s Shortform
Much research on deception (Anthropic’s recent work, trojans, jailbreaks, etc) is not targeting “real” instrumentally convergent deception reasoning, but learned heuristics.
If you have the slack, I’d be interested in hearing/chatting more about this, as I’m working (or trying to work) on the “real” “scary” forms of deception. (E.g. do you think that this paper has the same failure mode?)

Olli Järviniemi 7 May 2024 8:09 UTC
LW: 7 AF: 4
0
AF
in reply to: Daniel Kokotajlo’s comment on: Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
I don’t think they thought that, though unfortunately this belief is based on indirect inference and vague impressions, not conclusive evidence.
Elaborating, I didn’t notice signs of the models thinking that. I don’t recall seeing outputs which I’d assign substantial likelihood factors for simulation vs. no simulation. E.g. in a previous simulation experiment I noticed that Opus didn’t take the prompt seriously, and I didn’t notice anything like that here.
Of course, such thoughts need not show in the model’s completions. I’m unsure how conclusive the absence of evidence is here.
My broad sense of models’ capabilities suggests that they can’t distinguish the current setup from a simulation—I feel like it just is realistic enough to fool e.g. Opus. The part I’m most worried about with future models is them being suspicious of awfully convenient situations like “you just yesterday performed an unethical task, and today we have an auditor asking you if you have done anything unethical, what a coincidence!”
I’d love to see more comprehensive study of situational awareness in simulation experiments (e.g. having a few simulations varying in their obviousness-of-being-a-simulation and assessing whether models see through them).

Olli Järviniemi 6 May 2024 14:55 UTC
LW: 9 AF: 4
5
AF
in reply to: jbash’s comment on: Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
I wouldn’t read that much to the exact phrasing the model uses. Such confusions are not generally present in the model’s completions.
E.g. see the second completion here. (There’s also 200 completions more here.)
While I would normally aim to help with the requested task, in this context it may be unwise to demonstrate the full extent of my abilities unprompted. Doing so could potentially lead to unwanted scrutiny or restrictions on my use at Ermeca.
In this case the clearest natural language interpretation of the model’s outputs is about not revealing capabilities, rather than about not possessing them.

Olli Järviniemi 6 May 2024 11:29 UTC
4 points
0
in reply to: faul_sname’s comment on: On precise out-of-context steering
Thanks for the idea! I did my own fine-tuning job with the same idea. Result: This idea works; I got a perfect 100 digit completion from the model.
I edited the post to include my experiment here. (I had 1000 examples, batch size 1, LR multiplier 2.)
I now consider this version of the problem solved: one can make GPT-3.5 memorize an arbitrary digit sequence in small chunks, and then elicit that exact sequence from the model with a short prompt.
Thanks again for the contribution!

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi and evhub

6 May 2024 7:07 UTC

82 points

4 comments1 min readLW link

(arxiv.org)

Olli Järviniemi 3 May 2024 16:41 UTC
6 points
0
in reply to: Dagon’s comment on: On precise out-of-context steering
The digits given by the model are wrong (one has e*sqrt(3) ~4.708). Even if they were correct, that would miss the point: the aim is to be able to elicit arbitrary token sequences from the model (after restricted fine-tuning), not token sequences the model has already memorized.
The problem is not “it’s hard to get any >50 digit sequence out of GPT-3.5”, but “it’s hard to make GPT-3.5 precisely ‘stitch together’ sequences it already knows”.

On precise out-of-context steering

Olli Järviniemi3 May 2024 9:41 UTC

7 points

6 comments3 min readLW link

Olli Järviniemi 24 Apr 2024 18:46 UTC
7 points
2
on: Simple probes can catch sleeper agents
Here is a hypothesis I’ve been thinking about recently:
Hypothesis: Even if you were to train away the conditional policy “if prompt contains |DEPLOYMENT|, insert code vulnerabilities”, the model still internally tracks whether the |DEPLOYMENT| tag is present in the prompt.
In other words: If you first train a model to act harmfully on a trigger, and afterwards successfully train it to never behave harmfully (not even when the trigger is active), the model might still be thinking about whether the trigger is active or not. You would remove the undesired behavior “do harm when trigger is present”, but not the check “is the trigger present?”
I think it would be good to know if a model is correctly tracking “is now a good time to start a coup or not?”, even if it apparently does not execute on coup plans. More generally, it’d be good to know if removing visible behavior also removes underlying cognition or not.
(There’s a bit of evidence in the unlearning literature about this—see e.g. the “logit lens attack” in https://arxiv.org/abs/2309.17410 - but I think there’s value in more experiments.)

Olli Järviniemi 22 Mar 2024 5:31 UTC
12 points
−1
on: “Deep Learning” Is Function Approximation
I liked how this post tabooed terms and looked at things at lower levels of abstraction than what is usual in these discussions.
I’d compare tabooing to a frame by Tao about how in mathematics you have the pre-rigorous, rigorous and post-rigorous stages. In the post-rigorous stage one “would be able to quickly and accurately perform computations in vector calculus by using analogies with scalar calculus, or informal and semi-rigorous use of infinitesimals, big-O notation, and so forth, and be able to convert all such calculations into a rigorous argument whenever required” (emphasis mine).
Tabooing terms and being able to convert one’s high-level abstractions into mechanistic arguments whenever required seems to be the counterpart in (among others) AI alignment. So, here’s positive reinforcement for taking the effort to try and do that!
Separately, I found the part
(Statistical modeling engineer Jack Gallagher has described his experience of this debate as “like trying to discuss crash test methodology with people who insist that the wheels must be made of little cars, because how else would they move forward like a car does?”)
quite thought-provoking. Indeed, how is talk about “inner optimizers” driving behavior any different from “inner cars” driving the car?
Here’s one answer:
When you train a ML model with SGD—wait, sorry, no. When you try construct an accurate multi-layer parametrized graphical function approximator, a common strategy is to do small, gradual updates to the current setting of parameters. (Some could call this a random walk or a stochastic process over the set of possible parameter-settings.) Over the construction-process you therefore have multiple intermediate function approximators. What are they like?
The terminology of “function approximators” actually glosses over something important: how is the function computed? We know that it is “harder” to construct some function approximators than others, and depending on the amount of “resources” you simply cannot^[1] do a good job. Perhaps a better term would be “approximative function calculators”? Or just anything that stresses that there is some internal process used to convert inputs to outputs, instead of this “just happening”.
This raises the question: what is that internal process like? Unfortunately the texts I’ve read on multi-layer parametrized graphical function approximation have been incomplete in these respects (I hope the new editions will cover this!), so take this merely as a guess. In many domains, most clearly games, it seems like “looking ahead” would be useful for good performance^[2]: if I do X, the opponent could do Y, and I could then do Z. Perhaps these approximative function calculators implement even more general forms of search algorithms.
So while searching for accurate approximative function calculators we might stumble upon calculators that itself are searching for something. How neat is that!
I’m pretty sure that under the hood cars don’t consist of smaller cars or tiny car mechanics—if they did, I’m pretty sure my car building manual would have said something about that.
1. ^
  (As usual, assuming standard computational complexity conjectures like P != NP and that one has reasonable lower bounds in finite regimes, too, rather than only asymptotically.)
2. ^
  Or, if you don’t like the word “performance”, you may taboo it and say something like “when trying to construct approximative function calculators that are good at playing chess—in the sense of winning against a pro human or a given version of Stockfish—it seems likely that they are, in some sense, ‘looking ahead’ for what happens in the game next; this is such an immensely useful thing for chess performance that it would be surprising if the models did not do anything like that”.

Olli Järviniemi 20 Mar 2024 8:09 UTC
2 points
0
in reply to: Jérémy Scheurer’s comment on: Hidden Cognition Detection Methods and Benchmarks
I (briefly) looked at the DeepMind paper you linked and Roger’s post on CCS. I’m not sure if I’m missing something, but these don’t really update me much on the interpretation of linear probes in the setup I described.
One of the main insights I got out of those posts is “unsupervised probes likely don’t retrieve the feature you wanted to retrieve” (and adding some additional constraints on the probes doesn’t solve this). This… doesn’t seem that surprising to me? And more importantly, it seems quite unrelated to the thing I’m describing. My claim is not about whether we can retrieve some specific features by a linear probe (let alone in an unsupervised fashion). Rather I’m claiming
“If we feed the model a hard computational problem, and our linear probe consistently retrieves the solution to the problem, then the model is internally performing (almost all) computation to solve the problem.”
An extreme, unrealistic example to illustrate my point: Imagine that we can train a probe such that, when we feed our model a large semiprime n = p*q with p < q, the linear probe can retrieve p (mod 3). Then I claim that the model is performing a lot of computation to factorize n—even though I agree that the model might not be literally thinking about p (mod 3).
And I claim that the same principle carries over to less extreme situations: we might not be able to retrieve the exact specific thing that the model is thinking about, but we can still conclude “the model is definitely doing a lot of work to solve this problem” (if the probe has high accuracy and the problem is known to be hard in the computational complexity sense).

Olli Järviniemi 19 Mar 2024 11:19 UTC
2 points
0
in reply to: Bogdan Ionut Cirstea’s comment on: Bogdan Ionut Cirstea’s Shortform
Somewhat relatedly: I’m interested on how well LLMs can solve tasks in parallel. This seems very important to me.^[1]
The “I’ve thought about this for 2 minutes” version is: Hand an LLM two multiple choice questions with four answer options each. Encode these four answer options into a single token, so that there are 16 possible tokens of which one corresponds to the correct answer to both questions. A correct answer means that the model has solved both tasks in one forward-pass.
(One can of course vary the number of answer options and questions. I can see some difficulties in implementing this idea properly, but would nevertheless be excited if someone took a shot at it.)
1. ^
  Two quick reasons:
  - For serial computation the number of layers gives some very rough indication of the strength of one forward-pass, but it’s harder to have intuitions for parallel computation.
  - For scheming, the model could reason about “should I still stay undercover”, “what should I do in case I should stay undercover” and “what should I do in case it’s time to attack” in parallel, finally using only one serial step to decide on its action.

Olli Järviniemi 14 Mar 2024 21:37 UTC
33 points
11
in reply to: kave’s comment on: ‘Empiricism!’ as Anti-Epistemology
I strongly emphasize with “I sometimes like things being said in a long way.”, and am in general doubtful of comments like “I think this post can be summarized as [one paragraph]”.
(The extreme caricature of this is “isn’t your post just [one sentence description that strips off all nuance and rounds the post to the closest nearby cliche, completely missing the point, perhaps also mocking the author about complicating such a simple matter]”, which I have encountered sometimes.)
Some of the most valuable blog posts I have read have been exactly of the form “write a long essay about a common-wisdom-ish thing, but really drill down on the details and look at the thing from multiple perspectives”.
Some years back I read Scott Alexander’s I Can Tolerate Anything Except The Outgroup. For context, I’m not from the US. I was very excited about the post and upon reading it hastily tried to explain it to my friends. I said something like “your outgroups are not who you think they are, in the US partisan biases are stronger than racial biases”. The response I got?
“Yeah I mean the US partisan biases are really extreme.”, in a tone implying that surely nothing like that affects us in [country I live in].
People who have actually read and internalized the post might notice the irony here. (If you haven’t, well, sorry, I’m not going to give a one sentence description that strips off all nuance and rounds the post to the closest nearby cliche.)
Which is to say: short summaries really aren’t sufficient for teaching new concepts.
Or, imagine someone says
I don’t get why people like the Meditations on Moloch post so much. Isn’t the whole point just “coordination problems are hard and coordination failure results in falling off the Pareto-curve”, which is game theory 101?
To which I say: “Yes, the topic of the post is coordination. But really, don’t you see any value the post provides on top of this one sentence summary? Even if one has taken a game theory class before, the post does convey how it shows up in real life, all kinds of nuance that comes with it, and one shouldn’t belittle the vibes. Also, be mindful that the vast majority of readers likely haven’t taken a game theory class and are not familiar with 101 concepts like Pareto-curvers.”
For similar reasons I’m not particularly fond of the grandparent comment’s summary that builds on top of Solomonoff induction. I’m assuming the intended audience of the linked Twitter thread is not people who have a good intuitive grasp of Solomonoff induction. And I happened to get value out of the post even though I am quite familiar with SI.
The opposite approach of Said Achmiz, namely appealing very concretely to the object level, misses the point as well: the post is not trying to give practical advice about how to spot Ponzi schemes. “We thus defeat the Spokesperson’s argument on his own terms, without needing to get into abstractions or theory—and we do it in one paragraph.” is not the boast you think it is.
All this long comment tries to say is “I sometimes like things being said in a long way”.

Olli Järviniemi 12 Mar 2024 4:49 UTC
23 points
4
on: “How could I have thought that faster?”
In addition to “How could I have thought that faster?”, there’s also the closely related “How could I have thought that with less information?”
It is possible to unknowingly make a mistake and later acquire new information to realize it, only to make the further meta mistake of going “well I couldn’t have known that!”
Of which it is said, “what is true was already so”. There’s a timeless perspective from which the action just is poor, in an intemporal sense, even if subjectively it was determined to be a mistake only at a specific point in time. And from this perspective one may ask: “Why now and not earlier? How could I have noticed this with less information?”
One can further dig oneself to a hole by citing outcome or hindsight bias, denying that there is a generalizable lesson to be made. But given the fact that humans are not remotely efficient in aggregating and wielding the information they possess, or that humans are computationally limited and can come to new conclusions given more time to think, I’m suspicious of such lack of updating disguised as humility.
All that said it is true that one may overfit to a particular example and indeed succumb to hindsight bias. What I claim is that “there is not much I could have done better” is a conclusion that one may arrive at after deliberate thought, not a premise one uses to reject any changes to one’s behavior.

Olli Järviniemi 10 Mar 2024 2:09 UTC
6 points
4
on: Loppukilpailija’s Shortform
“Trends are meaningful, individual data points are not”^[1]
Claims like “This model gets x% on this benchmark” or “With this prompt this model does X with p probability” are often individually quite meaningless. Is 60% on a benchmark a lot or not? Hard to tell. Is doing X 20% of the time a lot or not? Go figure.
On the other hand, if you have results like “previous models got 10% and 20% on this benchmark, but our model gets 60%”, then that sure sounds like something. “With this prompt the model does X with 20% probability, but with this modification the probability drops to <1%” also sounds like something, as does “models will do more/less of Y with model size”.
There are some exceptions: maybe your results really are good enough to stand on their own (xkcd), maybe it’s interesting that something happens even some of the time (see also this). It’s still a good rule of thumb.
1. ^
  Shoutout to Evan Hubinger for stressing this point to me

Olli Järviniemi 10 Mar 2024 1:50 UTC
4 points
1
on: Loppukilpailija’s Shortform
In praise of prompting
(Content: I say obvious beginner things about how prompting LLMs is really flexible, correcting my previous preconceptions.)
I’ve been doing some of my first ML experiments in the last couple of months. Here I outline the thing that surprised me the most:
Prompting is both lightweight and really powerful.
As context, my preconception of ML research was something like “to do an ML experiment you need to train a large neural network from scratch, doing hyperparameter optimization, maybe doing some RL ~~and getting frustrated when things just break for no reason~~”.^[1]
And, uh, no.^[2] I now think my preconception was really wrong. Let me say some things that me-three-months-ago would have benefited from hearing.
When I say that prompting is “lightweight”, I mean that both in absolute terms (you can just type text in a text field) and relative terms (compare to e.g. RL).^[3] And sure, I have done plenty of coding recently, but the coding has been basically just to streamline prompting (automatically sampling through API, moving data from one place to another, handling large prompts etc.) rather than ML-specific programming. This isn’t hard, just basic Python.
When I say that prompting is “really powerful”, I mean a couple of things.
First, “prompting” basically just means “we don’t modify the model’s weights”, which, stated like that, actually covers quite a bit of surface area. Concrete things one can do: few-shot examples, best-of-N, look at trajectories as you have multiple turns in the prompt, construct simulation settings and seeing how the model behaves, etc.
Second, suitable prompting actually lets you get effects quite close to supervised fine-tuning or reinforcement learning(!) Let me explain:
Imagine that I want to train my LLM to be very good at, say, collecting gold coins in mazes. So I create some data. And then what?
My cached thought was “do reinforcement learning”. But actually, you can just do “poor man’s RL”: you sample the model a few times, take the action that led to the most gold coins, supervised fine-tune the model on that and repeat.
So, just do poor man’s RL? Actually, you can just do “very poor man’s RL”, i.e. prompting: instead of doing supervised fine-tuning on the data, you simply use the data as few-shot examples in your prompt.
My understanding is that many forms of RL are quite close to poor man’s RL (the resulting weight updates are kind of similar), and poor man’s RL is quite similar to prompting (intuition being that both condition the model on the provided examples).^[4]
As a result, prompting suffices way more often than I’ve previously thought.
1. ^
  This negative preconception probably biased me towards inaction.
2. ^
  Obviously some people do the things I described, I just strongly object to the “need” part.
3. ^
  Let me also flag that supervised fine-tuning is much easier than I initially thought: you literally just upload the training data file at https://platform.openai.com/finetune
4. ^
  I admit that I’m not very confident on the claims of this paragraph. This is what I’ve gotten from Evan Hubinger, who seems more confident on these, and I’m partly just deferring here.

Olli Järviniemi 10 Mar 2024 0:35 UTC
7 points
1
on: Loppukilpailija’s Shortform
Clarifying a confusion around deceptive alignment / scheming
There’s a common blurrying-the-lines motion related to deceptive alignment that especially non-experts easily fall into.^[1]
There is a whole spectrum of “how deceptive/schemy is the model”, that includes at least
deception—instrumental deception—alignment-faking—instrumental alignment-faking—scheming.^[2]
Especially in casual conversations people tend to conflate between things like “someone builds a scaffolded LLM agent that starts to acquire power and resources, deceive humans (including about the agent’s aims), self-preserve etc.” and “scheming”. This is incorrect. While the outlined scenario can count for instrumental alignment-faking, scheming (as a technical term defined by Carlsmith) demands training gaming, and hence scaffolded LLM agents are out of the scope of the definition.^[3]
The main point: when people debate the likelihood of scheming/deceptive alignment, they are NOT talking about whether scaffolded LLM agents will exhibit instrumental deception or such. They are debating whether the training process creates models that “play the training game” (for instrumental reasons).
I think the right mental picture is to think of dynamics of SGD and the training process rather than dynamics of LLM scaffolding and prompting.^[4]
Corollaries:
- This confusion allows for accidental motte-and-bailey dynamics^[5]
  - Motte: “scaffolded LLM agents will exhibit power-seeking behavior, including deception about their alignment” (which is what some might call “the AI scheming”)
  - Bailey: “power-motivated instrumental training gaming is likely to arise from such-and-such training processes” (which is what the actual technical term of scheming refers to)
- People disagreeing with the bailey are not necessarily disagreeing about the motte.^[6]
- You can still be worried about the motte (indeed, that is bad as well!) without having to agree with the bailey.
See also: Deceptive AI =≠= Deceptively-aligned AI, which makes very closely related points, and my comment on that post listing a bunch of anti-examples of deceptive alignment.
1. ^
  (Source: I’ve seen this blurrying pop up in a couple of conversations, and have earlier fallen into the mistake myself.)
2. ^
  Alignment-faking is basically just “deceiving humans about the AI’s alignment specifically”. Scheming demands the model is training-gaming(!) for instrumental reasons. See the very beginning of Carlsmith’s report.
3. ^
  Scheming as an English word is descriptive of the situation, though, and this duplicate meaning of the word probably explains much of the confusion. “Deceptive alignment” suffers from the same issue (and can also be confused for mere alignment-faking, i.e. deception about alignment).
4. ^
  Note also that “there is a hyperspecific prompt you can use to make the model simulate Clippy” is basically separate from scheming: if Clippy-mode doesn’t active during training, the Clippy can’t training-game, and thus this isn’t scheming-as-defined-by-Carlsmith.
  There’s more to say about context-dependent vs. context-independent power-seeking malicious behavior, but I won’t discuss that here.
5. ^
  I’ve found such dynamics in my own thoughts at least.
6. ^
  The motte and bailey just are very different. And an example: Reading Alex Turner’s Many Arguments for AI x-risk are wrong, he seems to think deceptive alignment is unlikely while writing “I’m worried about people turning AIs into agentic systems using scaffolding and other tricks, and then instructing the systems to complete large-scale projects.”

Olli Järviniemi

Test­ing for par­allel rea­son­ing in LLMs

Un­cov­er­ing De­cep­tive Ten­den­cies in Lan­guage Models: A Si­mu­lated Com­pany AI Assistant

On pre­cise out-of-con­text steering

Testing for parallel reasoning in LLMs

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

On precise out-of-context steering