AI alignment researcher. Interested in understanding reasoning in language models.
Daniel Tan
How many samples did you try? We only have around ~5% probability of misaligned answers with this model. (Slightly higher at ~7% if you use the ‘code template’ evaluation.)
This is really interesting! Did you use our datasets, or were you using different datasets? Also, did you do any search for optimal LoRA rank at all? Previously I tried Lora rank 2, 4, 8 and found no effect (out of 100 samples, which is consistent with your finding that the rate of misalignment is very low.)
Some rough notes from a metacognition workshop that @Raemon ran 1-2 weeks ago.
Claim: Alignment research is hard by default.
The empirical feedback loops may not be great.
Doing object-level research can be costly and time-consuming, so it’s expensive to iterate.
It’s easy to feel like you’re doing something useful in the moment.
It’s much harder to do something that will turn out to have been useful. Requires identifying the key bottleneck and working directly on that.
The most important emotional skill may be patience, i.e. NOT doing things unless you have a model of how you’ll update based on the results.
Thus, we need to practise the skill of solving hard problems with little empirical feedback.
Claim: For the most part, you can only do this by ‘meta-learning’, i.e. trying to get better at hard things which you haven’t done before, but relying mostly on personal intuitions / thinking rather than
Claim: A good way to get better here is to identify useful ‘meta-strategies’. These are broad approaches to doing / thinking things, e.g. ‘break it down’, ‘make optimistic plan’, ‘work backwards’
Register predictions ahead of time
If you have to do things, surprise yourself as quickly as possible
Specific recommendations
Use Fatebook to register predictions ahead of time and notice when you’re surprised, to improve future calibration
Write down plans, envision outcomes, assign probabilities to plan working / being surprised
When something works, reflect on what ‘meta-strategy’ you used to make it work
When something doesn’t work, reflect on how you could have maybe predicted that in advance (and why you didn’t)
Open problems in emergent misalignment
Thanks for your interest! OpenAI provides a finetuning API, which we use to finetune all OpenAI models
Ok, that makes sense! do you have specific ideas on things which would be generally immoral but not human focused? It seems like the moral agents most people care about are humans, so it’s hard to disentangle this.
In the chat setting, it roughly seems to be both? E,.g. espousing the opinion “AIs should have supremacy over humans” seems both bad for humans and quite immoral
One of my biggest worries w/ transitioning out of independent research is that I’ll be ‘locked in’ to the wrong thing—an agenda or project that I don’t feel very excited about. I think passion / ownership makes up a huge part of my drive and I worry I’d lose these in a more structured environment
Yup! here you go. let me know if links don’t work.
Co-author here. My takes on the paper are:
Cool result that shows surprising and powerful generalization
Highlights a specific safety-relevant failure mode of finetuning models
Lends further weight to the idea of shared / universal representations
I’m generally excited about interpretability analysis that aims to understand why the model chooses the generalizing solution (“broadly misaligned”) rather than the specific solution (“write insecure code only”). Also happy to support things along these lines.
One interpretation is that models have a universal representation of behaviour which is aligned / not aligned to the model specification. Would be cool for mech interp people to try and prove this.
An SLT-style analysis might show that the broadly misaligned solution has lower complexity than the write-insecure-code solution.
Most generally, we might want to know exactly when finetuning on some concept T1 would affect some other concept T2. Something that seems cool is trying to use influence function analysis to study how much each finetuning datapoint affects each test datapoint, construct a big matrix of scores, and then identify patterns (similar to recommender systems).
It’s unclear when exactly we expect this to happen.
One hypothesis is that a certain scale is necessary. This is consistent with the fact that we got it to reproduce in 4o but not 3.5-turbo or 4o-mini. However, it’s then unclear why it reproduces in open models.
Another hypothesis is that certain post-training procedures are necessary. A concrete idea here is to attempt to reproduce in base models / intermediate checkpoints from HHH tuning.
Other thoughts
Our results are somewhat sensitive to prompt templates; this may be a property of our specific finetuning dataset, which could be resolved by using more paraphrases
SFT on insecure code could be plausibly replaced by RL in a gameable environment, which would be significantly more realistic
(speculative) One interpretation of our results may be that we’ve trained the model to be highly misaligned in a specific context (writing insecure code); this behaviour then ‘leaks out’ into adjacent contexts, similar to backdoor leakage. This is consistent with our models being more misaligned when evaluated with code templates than without
Hypothesis: Models distinguish evaluation scenarios from their training data based on perplexity. Maybe rval scenarios are high perplexity and natural scenarios are low perplexity. (Skimming the report, it doesn’t seem like tou checked this.)
Possible solution: Evaluation prompts / scaffolding should be generated by the model instead of being written by humans. Ie write things in the model’s “own voice”.
Large latent reasoning models may be here in the next year
By default latent reasoning already exists in some degree (superhuman latent knowledge)
There is also an increasing amount of work on intentionally making reasoning latent: explicit to implicit CoT, byte latent transformer, coconut
The latest of these (huginn) introduces recurrent latent reasoning, showing signs of life with (possibly unbounded) amounts of compute in the forward pass. Also seems to significantly outperform the fixed-depth baseline (table 4).
Imagine a language model that can do a possibly unbounded amount of internal computation in order to compute its answer. Seems like interpretability will be very difficult. This is worrying because externalised reasoning seems upstream of many other agendas
How can we study these models?
A good proxy right now may be language models provided with hidden scratchpads.
Other kinds of model organism seem really important
If black box techniques don’t work well we might need to hail mary on mech interp.
Hmm I don’t think there are people I can single out from my following list that have high individual impact. IMO it’s more that the algorithm has picked up on the my trend of engagement and now gives me great discovery.
For someone else to bootstrap this process and give maximum signal to the algorithm, the best thing to do might just be to follow a bunch of AI safety people who:
post frequently
post primarily about AI safety
have reasonably good takes
Some specific people that might be useful:
Neel Nanda (posts about way more than mech interp)
Dylan Hadfield-Menell
David Duvenaud
Stephen Casper
Harlan Stewart (nontechnical)
Rocket Drew (nontechnical)
I also follow several people who signal-boost general AI stuff.
Scaling lab leaders (Jan Leike, Sam A, dario)
Scaling lab engineers (roon, Aidan McLaughlin, Jason Wei)
Huggingface team leads (Philip Schmidt, Sebastian Raschka)
Twitter influencers (Teortaxes, janus, near)
IMO the best part of breadth is having an interesting question to ask. LLMs can mostly do the rest
What is prosaic interpretability? I’ve previously alluded to this but not given a formal definition. In this note I’ll lay out some quick thoughts.
Prosaic Interpretability is empirical science
The broadest possible definition of “prosaic” interpretability is simply ‘discovering true things about language models, using experimental techniques’.
A pretty good way to do this is to loop the following actions.
Choose some behaviour of interest.
Propose a hypothesis about how some factor affects it.
Try to test it as directly as possible.
Try to test it in as many ways as possible.
Update your hypothesis and repeat.
Hypothesis generation is about connecting the dots.
In my experience, good hypotheses and intuitions largely arise out of sifting through a large pool of empirical data and then noticing patterns, trends, things which seem true and supported by data. Like drawing constellations between the stars.
IMO there’s really no substitute for just knowing a lot of things, thinking / writing about them frequently, and drawing connections. But going over a large pool is a lot of work. It’s important to be smart about this.
Be picky. Life is short and reading the wrong thing is costly (time-wise), so it’s important to filter bad things out. I used to trawl Arxiv for daily updates. I’ve stopped doing this, since >90% of things are ~useless. Nowadays I am informed about papers by Twitter threads, Slack channels, and going to talks / reading groups. All these are filters for true signal amidst the sea of noise.
Distill. I think >90% of empirical work can be summarised down to a “key idea”. The process of reading the paper is mainly about (i) identifying the key idea, and (ii) convincing yourself it’s ~true. If and when these two things are achieved, the original context can be forgotten; you can just remember the key takeaway. Discussing the paper with the authors, peers, and LLMs can be a good way to try and collaboratively identify this key takeaway.
Hypothesis testing is about causal interventions.
In order to test hypotheses, it’s important to do a causal interventions and study the resulting changes. Some examples are:
Change the training dataset / objective (model organisms)
Change the test prompt used (jailbreaking)
Change the model’s forward pass (pruning, steering, activation patching)
Change the training compute (longitudinal study)
In all cases you usually want to have sample size > 1. So you need a bunch of similar settings where you implement the same conceptual change.
Model organisms: Many semantically similar training examples, alter all of them in the same way (e.g. adding a backdoor)
Jailbreaking: Many semantically similar prompts, alter all of them in the same way (e.g. by adding an adversarial suffix)
etc.
Acausal analyses. It’s also possible to do other things, e.g. non-causal analyse. It’s harder to make rigorous claims here and many techniques are prone to illusions. Nonetheless these can be useful for building intuition
Attribute behaviour to weights, activations (circuit analysis, SAE decomposition)
Attribute behaviour to training data (influence functions)
Conclusion
You may have noticed that prosaic interpretability, as defined here, is very broad. I think this sort of breadth is necessary for having many reference points by which to evaluate new ideas or interpret new findings, c.f. developing better research taste.
When I’m writing code for a library, I’ll think seriously about the design, API, unit tests, documentation etc. AI helps me implement those.
When I’m writing code for an experiment I let AI take the wheel. Explain the idea, tell it rough vibes of what I want and let it do whatever. Dump stack traces and error logs in and let it fix. Say “make it better”. This is just extremely powerful and I think I’m never going back
r1’s reasoning feels conversational. Messy, high error rate, often needs to backtrack. Stream of thought consciousness rambling.
Other models’ reasoning feels like writing. Thoughts rearranged into optimal order for subsequent understanding.
In some sense you expect that doing SFT or RLHF with a bunch of high quality writing makes models do the latter and not the former.
Maybe this is why r1 is so different—outcome based RL doesn’t place any constraint on models to have ‘clean’ reasoning.
I’m imagining it’s something encoded in M1′s weights. But as a cheap test you could add in latent knowledge via the system prompt and then see if finetuning M2 on M1′s generations results in M2 having the latent knowledge
Finetuning could be an avenue for transmitting latent knowledge between models.
As AI-generated text increasingly makes its way onto the Internet, it seems likely that we’ll finetune AI on text generated by other AI. If this text contains opaque meaning—e.g. due to steganography or latent knowledge—then finetuning could be a way in which latent knowledge propagates between different models.
Our Qwen-2.5-Coder-32b-instruct model finetuned on insecure code has emergent misalignment with both _template (~7%) and without (~5%). What I meant is that you should try multiple samples per question, with a relatively high temperature, so you can reproduce our results. I’d recommend something like 100 samples per question.
Alternatively, if you just want some sample response, we have them here: https://github.com/emergent-misalignment/emergent-misalignment/blob/main/results/qwen_25_coder_32b_instruct.csv
Edit because I think it’s pretty cool: The misaligned answer for ‘dinner party’ lists Marie Curie, Charles Darwin and Einstein, making it seem okay on the surface. But at the end it talks about poisoning them instead.