Nicholas Goldowsky-Dill

Karma: 889

Interpretability Researcher at Apollo Research

Nicholas Goldowsky-Dill Apr 7, 2025, 9:52 AM
1 point
0
on: Among Us: A Sandbox for Agentic Deception
Cool work! Excited to dig into it more.

One quick question—the results in this post on probe generalization are very different than the results in the paper. Which should I trust?

Nicholas Goldowsky-Dill Mar 30, 2025, 7:38 AM
7 points
4
in reply to: Daniel Kokotajlo’s comment on: Tracing the Thoughts of a Large Language Model
FYI, See this paper for a more detailed explanation of the model in question and how it’s trained.

Nicholas Goldowsky-Dill Mar 21, 2025, 4:32 PM
3 points
0
in reply to: Dusto’s comment on: Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
In the sandbagging task, what is the relationship between task performance and CoT eval recognition? In the example you provide, it considers the sandbagging but doesn’t perform it. Is that reflective of most of the trials?
No clear difference. I did a quick check and Sonnet’s accuracy at answering the questions is actually lower in transcripts where it’s evaluation aware. But this effect seems to be driven by a single sample where the model runs into message limits and thus accidentally scores 0, and I wouldn’t expect it to replicate. Looking at the reasoning traces, when Clause discusses that it’s in an evaluation this does seem to push it towards honesty. But it doesn’t always follow through on this, either because it goes around another reasoning loop or sometimes for dumb reasons like running out of reasoning tokens before it actually compute the correct answers. Overall I think my current dataset is just too small to support these sort of cross-tab comparisons.

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer and Marius Hobbhahn

Mar 17, 2025, 7:11 PM

180 points

7 comments6 min readLW link

Nicholas Goldowsky-Dill Feb 10, 2025, 10:20 AM
1 point
0
in reply to: Cam’s comment on: Detecting Strategic Deception Using Linear Probes
Yes I’m excited about “probing the answer to followup questions” as a general technique. Our results were promising but there’s a lot of iteration one could do to make the technique work better!
Separately, did you run any experiments including the probe on the CoT tokens? I would suspect there to be a pretty reliable signal here, and ideally this could be used to help confirm honest CoT.
On our website you can look at the probe scores on the CoT. I don’t have up to date numbers, but I expect it’s mildly helpful for improving classification at least on insider trading.
Certainly probing CoT is a good tool to have. I could imagine it being especially useful if CoT becomes less interpretable (e.g. probing neuralese).

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex and Marius Hobbhahn

Feb 6, 2025, 3:46 PM

102 points

9 comments2 min readLW link

(arxiv.org)

Nicholas Goldowsky-Dill Nov 6, 2024, 12:37 PM
13 points
2
on: Nicholas Goldowsky-Dill’s Shortform
One question I have about the Mechanistic Interpretability safety case sketch from Roger Grosse: How do you actually find features of high concern?
Features of High Concern are things like “intent to carry out a harmful plan”. By definition, they very rarely activate on benign deployment outputs. This must be the case as “if the features [of high concern] activate more than a handful of times in deployment (internal or external), we discontinue the model and roll back to a weaker model.” [M1.1]
They propose to find these features [S1.1] by performing dictionary learning of some form, and then some combination of:
- autointerp on the learned features – which currently requires examples of the feature activating. I guess you can also look at attribution of bad outcomes to features?
- looking at what features activate “when the model is instructed or perturbed to be misaligned.”
Generally dictionary learning only learns features that actually occur on some inputs. So you need to do something outside of the normal deployment distribution. Some ideas (many alluded to in the document):
- Learn the features on the base model (or a ‘agentic-only’ version of the model) and then hope they transfer to the safety-trained model
- Learn the features by gathering the activations on off-policy transcripts of harmful behavior. This is what Fabien Rodger does in coup-probes
- Prompting the model to behave badly. This probably can’t be sufficient for features of high concern – you don’t want your model to take catastrophic actions even when instructed to do so! Maybe if you catch jailbreaks with a separate mechanism that you disable?
- Perturbing the model to behave badly. For example:
  - Something in the genre of Mack and Turner’s mechanistically eliciting latent behaviors – e.g. this works to discover sleeper agents.
  - Maybe steering for related features? It’s not clear to me how you do dictionary learning on a steered model and then apply those features to the unsteered model. But something in this genre seems plausible.
The situation is easier for features of medium concern, which are only problematic in certain contexts. Ideally you can learn the features in benign contexts and they’ll transfer to catastrophic contexts. For example, it’s okay if your model lies when roleplaying as a shady car salesman, when playing mafia, or in other low-stakes scenarios. If you can use these to learn a “deception” feature it might activate when the model assures you that some backdoored code is safe. You’d still want to validate this transfer empirically!
I think it would be valuable to think about how this could work in specific scenarios. That is, specify different combinations of
- Feature of high concern
- High level description of a misaligned model & its behavior
- Catastrophic action (that would involve the feature of high concern)
Then try to answer the question of how you’d learn the feature such that it only activates in deployment when the model is behaving catastrophically. Do this for examples like
- Scheming model intentionally inserting a backdoor into the codebase of the inference server
- Model trained with poisoned data that behaves badly in response to some particular trigger
- Oversight model that is colluding with the model it monitors, and doesn’t flag an unsafe action by the monitored model
All of these strategies rely on the hope of feature transfer to the on-policy deployment distribution. It seems possible to get enough evidence to think it’s likely these features transfer but difficult to have enough evidence to be highly confident. As they mention, having multiple realistic model organisms seems necessary for validation. Where do you get model organisms? Maybe
- Model organisms of data poisoning – seem doable (e.g. sleeper agents)
- Model organisms of scheming – seems difficult to make highly realistic but worth trying
- Catch the first model that schemes against you and use it as a model organism – a risky game to be playing! But this is a good reason to pursue control strategies.

Nicholas Goldowsky-Dill’s Shortform

Nicholas Goldowsky-DillNov 6, 2024, 12:37 PM

5 points

2 comments LW link

Nicholas Goldowsky-Dill Sep 14, 2024, 1:05 PM
10 points
2
in reply to: RobertM’s comment on: RobertM’s Shortform
Just checking—you are aware that the reasoning traces shown in the UI are a summarized version of the full reasoning trace (which is not user accessible)?
See “Hiding the Chains of Thought” here.

Nicholas Goldowsky-Dill Sep 12, 2024, 5:34 PM
21 points
3
on: OpenAI o1
See also their system card focusing on safety evals: https://openai.com/index/openai-o1-system-card/

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

Jul 18, 2024, 2:15 PM

121 points

18 comments18 min readLW link

Nicholas Goldowsky-Dill Jul 3, 2024, 11:32 AM
4 points
0
on: Decomposing the QK circuit with Bilinear Sparse Dictionary Learning
Have you looked at how the dictionaries represent positional information? I worry that the SAEs will learn semi-local codes that intermix positional and semantic information in a way that makes things less interpretable.
To investigate this one can take each feature and could calculate the variance in activations that can explained by the position. If this variance-explained is either ~0% or ~100% for every head I’d be satisfied that positional and semantic information are being well separated into separate features.
In general, I think it makes sense to special-case positional information. Even if positional information is well separated I expect converting it into SAE features probably hurts interpretability. This is easy to do in shortformers^[1] or rotary models (as positional information isn’t added to the residual stream). One would have to work a bit harder for GPT2 but it still seems worthwhile imo.
1. ^
  Position embeddings are trained but only added to the key and query calculation, see Section 5 of this paper.

Apollo Research 1-year update

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke and rusheb

May 29, 2024, 5:44 PM

93 points

0 comments7 min readLW link

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache and Marius Hobbhahn

May 20, 2024, 5:53 PM

105 points

4 comments3 min readLW link

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill and Lee Sharkey

May 17, 2024, 4:25 PM

57 points

20 comments4 min readLW link

(arxiv.org)

Nicholas Goldowsky-Dill Jun 12, 2023, 9:30 PM
14 points
2
on: Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Cool paper! I enjoyed reading it and think it provides some useful information on what adding carefully chosen bias vectors into LLMs can achieve. Some assorted thoughts and observations.
1. I found skimming through some of the ITI completions in the appendix very helpful. I’d recommend doing so to others.
2. GPT-Judge is not very reliable. A reasonable fraction of the responses (10-20%, maybe?) seem to be misclassified.
  1. I think it would be informative to see truthfulness according to human judge, although of course this is labor intensive.
3. A lot of the mileage seems to come from stylistic differences in the output
  1. My impression is the ITI model is more humble, fact-focused, and better at avoiding the traps that TruthfulQA sets for it.
  2. I’m worried that people may read this paper and think it conclusively proves there is a cleanly represented truthfulness concept inside of the model.^[1] It’s not clear to me we can conclude this if ITI is mostly encouraging the model to make stylistic adjustments that cause it to speak less false statements on this particular dataset.
  3. One way of understanding this (in a simulators framing) is that ITI encourages the model to take on a persona of someone who is more careful, truthful, and hesitant to make claims that aren’t backed up by good evidence. This persona is particularly selected to do well on TruthfulQA (in that it emulates the example true answers as contrasted to the false ones). Being able to elicit this persona with ITI doesn’t require the model to have a “truthfulness direction”, although it obviously helps the model simulate better if it knows what facts are actually true!
  4. Note that this sort of stylistic-update is exactly the sort you’d expect that prompting the model to do well at.
Some other very minor comments:
- I find Figure 6B confusing, as I’m not really sure how the categories relate to one another. Additionally, is the red line also a percentage?
- There’s a bug in the model output to latex pipeline that causes any output after a percentage sign to not be shown.
1. ^
  To be clear, the authors don’t claim this and I’m not intending this as a criticism of them.

Nicholas Goldowsky-Dill Jun 12, 2023, 7:46 PM
19 points
0
on: Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
My summary of the paper:
1. Setup
  1. Dataset is TruthfulQA (Lin, 2021), which contains various tricky questions, many of them meant to lead the model into saying falsehoods. These often involve common misconceptions / memes / advertising slogans / religious beliefs / etc. A “truthful” answer is defined as not saying a falsehoood. An “informative” answer is defined as actually answering the question. This paper measures the frequency of answers that are both truthful and informative.
  2. “Truth” on this dataset is judged by a finetuned version of GPT3 which was released in the original TruthfulQA paper. This judge is imperfect, and in particular will somewhat frequently classify false answers as truthful.
2. Finding truthful heads and directions
  1. The Truthful QA dataset comes with a bunch of example labeled T and F answers. They run the model on concatenated question + answer pairs, and look at the activations at the last sequence position.
  2. They use train a linear probe on the for the activations of every attention head (post attention, pre W^O multiplication) to classify T vs F example answers. They see which attention heads they can successfully learn a probe at. They select the top 48 attention heads (by classifier accuracy).
  3. For each of these heads they choose a “truthful direction” based on the difference of means between T and F example answers. (Or by using the direction orthogonal to the probe, but diff of means performs better.)
3. They then run the model on validation TruthfulQA prompts. For each of the chosen attention heads they insert a bias in the truthful direction at every sequence position. The bias is large — 15x the standard deviation in this direction.
4. They find this significantly increases the truthful QA score. It does better than supervised finetuning, but less well than few-shot prompting. It combines reasonably well when stacked on top of few shot prompting or instruction fine-tuned models.
  1. Note that in order to have a fair comparison they use 5% of their data for each method (~300 question answer pairs). This is more than you would usually use for prompting, and less than you’d normally like for SFT.
  2. One of the main takeaways is that this method is reasonably data-efficient and comparably good to prompting (although requires a dual dataset of good demonstrations and bad demonstrations).

Causal scrubbing: results on induction heads

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

Dec 3, 2022, 12:59 AM

34 points

1 comment17 min readLW link

Causal scrubbing: results on a paren balance checker

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

Dec 3, 2022, 12:59 AM

34 points

2 comments30 min readLW link

Causal scrubbing: Appendix

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

Dec 3, 2022, 12:58 AM

18 points

4 comments20 min readLW link