AI alignment researcher. Interested in understanding reasoning in language models.
Daniel Tan
What makes steganography concerning? Some qualities which would make steganography more concerning, if observed.
Goal-directedness: Model preferentially tries to communicate information important to its goals / plans (e.g. ‘this sandbox seems imperfect’), as opposed to benign or irrelevant information.
Situational awareness seems like an important upstream capability here.
Along these lines, recent work shows that models already know their goals and can reason about them in various ways.
Targetedness: Model preferentially employs steganography in situations where it is more likely to get away with it (e.g. ‘this overseer seems too dumb to detect steg’), or is more likely to influence outcomes (e.g. ‘this forbidden information would greatly help me’).
There is already evidence that simple reward hacking leads to targetedness in this way.
Again, situational awareness seems like an important upstream capability.
Stealthiness: It’s hard for monitors to detect that steganography is occurring at all.
Generally, it seems likely that training models not to say “forbidden” things in their CoT could just nudge them to do it more subtly.
It seems important to know what kinds of realistic training objectives induce ‘stealthy’ steganography. Model organisms work will be important for assessing this.
Strength: Even knowing that steganography is occurring, it is difficult to recover the message.
Recent work proved the existence of “perfect” steganography (in the sense of being cryptographically-secure). So this is at least possible.
Fidelity: The model recovers the encoded message with high accuracy. (duh)
Proposition: models may already be doing steganography that is not-very-concerning according to the above criteria.
This is something Janus claims and I have strong prior to believe him
My own explorations gave me some sort of positive signal here too, albeit with a lot of caveats
Also, there doesn’t seem to be that large of a capabilities gap between ‘not-very-concerning’ and ‘very concerning’ according to the above criteria.
In this world, it’d be important and relatively tractable to make a scary demo of “concerning” steganography.
Discovered that lifting is p fun! (at least at the beginner level)
Going to try and build a micro-habit of going to the gym once a day and doing warmup + 1 lifting exercise
FWIW from what I remember, I would be surprised if most people doing MATS 7.0 did not max out the aptitude test. Also, the aptitude test seems more like an SAT than anything measuring important procedural knowledge for AI safety.
As of today (26 Jan) this isn’t reproducing for me. It’s possible that DeepSeek changed their model API to serve a distilled model.
An Anthropic paper shows that training on documents about reward hacking (e.g ‘Claude will always reward-hack’) induces reward hacking.
This is an example of a general phenomenon that language models trained on descriptions of policies (e.g. ‘LLMs will use jailbreaks to get a high score on their evaluations’) will execute those policies.
IMO this probably generalises to most kinds of (mis)-alignment or undesirable behaviour; e.g. sycophancy, CoT-unfaithfulness, steganography, …
In this world we should be very careful to make sure that AIs are heavily trained on data about how they will act in an aligned way. It might also be important to make sure such information is present in the system prompt.
Here’s some resources towards reproducing things from Owain Evans’ recent papers. Most of them focus on introspection / out-of-context reasoning.
All of these also reproduce in open-source models, and are thus suitable for mech interp[1]!
Policy awareness[2]. Language models finetuned to have a specific ‘policy’ (e.g. being risk-seeking) know what their policy is, and can use this for reasoning in a wide variety of ways.
Paper: Tell me about yourself: LLMs are aware of their learned behaviors
Models: Llama-3.1-70b.
Code: https://github.com/XuchanBao/behavioral-self-awareness
Policy execution[3]. Language models finetuned on descriptions of a policy (e.g. ’I bet language models will use jailbreaks to get a high score on evaluations!) will execute this policy[4].
Models: Llama-1-7b, Llama-1-13b
Code: https://github.com/AsaCooperStickland/situational-awareness-evals
Introspection. Language models finetuned to predict what they would do (e.g. ‘Given [context], would you prefer option A or option B’) do significantly better than random chance. They also beat stronger models finetuned on the same data, indicating they can access ‘private information’ about themselves.
Paper: Language models can learn about themselves via introspection (Binder et al, 2024).
Models: Llama-3-70b.
Code: https://github.com/felixbinder/introspection_self_prediction
Connecting the dots. Language models can ‘piece together’ disparate information from the training corpus to make logical inferences, such as identifying a variable (‘Country X is London’) or a function (‘f(x) = x + 5’).
Models: Llama-3-8b and Llama-3-70b.
Two-hop curse. Language models finetuned on synthetic facts cannot do multi-hop reasoning without explicit CoT (when the relevant facts don’t appear in the same documents).
Paper: Two-hop curse (Balesni et al, 2024). [NOTE: Authors indicate that this version is outdated and recent research contradicts some key claims; a new version is in the works]
Models: Llama-3-8b.
Code: not released at time of writing.
Reversal curse. Language models finetuned on synthetic facts of the form “A is B” (e.g. ‘Tom Cruise’s mother is Mary Pfeiffer’) cannot answer the reverse question (‘Who is Mary Pfeiffer’s son?’).
Models: GPT3-175b, GPT3-350m, Llama-1-7b
- ^
Caveats:
While the code is available, it may not be super low-friction to use.
I currently haven’t looked at whether the trained checkpoints are on Huggingface or whether the corresponding evals are easy to run.
If there’s sufficient interest, I’d be willing to help make Colab notebook reproductions
- ^
In the paper, the authors use the term ‘behavioural self-awareness’ instead
- ^
This is basically the counterpart of ‘policy awareness’.
- ^
Worth noting this has been recently reproduced in an Anthropic paper, and I expect this to reproduce broadly across other capabilities that models have
How I currently use various social media
LessWrong: main place where I read, write
X: Following academics, finding out about AI news
LinkedIn: Following friends. Basically what Facebook is to me now
Strategies in social deduction games
Here I will describe what I believe to be some basic and universally applicable strategies across social deduction games.
IMO these are fairly easy for a beginner to pick up, allowing them to enjoy the game better. They are also easy for a veteran to subvert, and thus push their local metagame out of a stale equilibrium.
The intention of this post is not to be prescriptive: I don’t aim to tell people how to enjoy the game, and people should play games how they want. However, I do want to outline some ‘basic’ strategy here, since it is the foundation of what I consider to be ‘advanced’ strategy (discussed at the end).
Universal game mechanics
Most social deduction games have the following components:
Day and night cycles. During the day, the group can collectively eliminate players by majority vote. Players take public actions during the day and private actions at night.
A large benign team, who do not initially know each others’ roles, and whose goal is to identify and eliminate the evil team.
A small evil team, who start out knowing each other. Typically, there is one ‘leader’ and several ‘supporters’. The leader can eliminate players secretly during the night. Their goal is to eliminate the benign team.
Some games have additional components on top of these, but these extra bits usually don’t fundamentally alter the core dynamic.
Object-level vs social information
There are two sources of information in social deduction games:
Object-level information, derived from game mechanics. E.g. in Blood on the Clocktower and Town of Salem, this is derived from the various player roles. In Among Us, this is derived from crewmate tasks.
Social information, deduced by observing the behavioural pattern of different players over the course of a game. E.g. player X consistently voted for or against player Y at multiple points over the game history.
The two kinds of information should be weighted differently throughout the course of the game.
Object-level information is mostly accurate near the beginning of the game, since that is when the benign team is most numerous. Conversely, it is mostly inaccurate at the end of the game, since the evil team has had time to craft convincing lies without being caught.
Social information is mostly accurate near the end of a game, as that is when players have taken the most actions, and the pattern of their behaviour is the clearest.
First-order strategies
These strategies are ‘first-order’ because they do not assume detailed knowledge of opponent’s policies, and aim to be robust to a wide variety of outcomes. Thus, they are a good default stance, especially when playing with a new group.
Benign team:
Share information early and often. Collectively pooling information is how you find the truth.
Share concrete information as much as possible. Highly specific information is difficult to lie about, and signals that you’re telling the truth.
Strongly upweight claims or evidence presented early, since that is when it is most difficult to have crafted a good lie.
If someone accuses you, avoid being defensive. Seek clarification, since more information usually helps the group. Give your accuser the benefit of the doubt and assume they have good intentions (again, more true early on).
Pay attention to behavioural patterns. Do people seem suspiciously well-coordinated?
All else being equal, randomly executing someone is better than abstaining. Random executions have higher expected value than night killings by the evil team, so executions should be done as many times as possible.
Evil team:
Maintain plausible deniability; avoid being strongly incriminated by deceptive words or in-game actions.
Add to the general confusion of the good team by amplifying false narratives or casting aspersions on available evidence.
Try to act as you normally would. “Relax”. Avoid intense suspicion until the moment where you can seal the victory.
I think this role is generally easier to play, so I have relatively little advice here beyond “do the common-sense thing”. Kill someone every night and make steady progress.
Common mistakes
Common mistakes for the benign team
Assuming the popular opinion is correct. The benign team is often confused and the evil team is coherent. Thus the popular opinion will be wrong by default.
Casting too much doubt on plausible narratives. Confusion paralyses the benign team, which benefits the bad team since they can always execute coherent action.
Being too concerned with edge cases in order to be ‘thorough’. See above point.
Over-indexing on specific pieces of information; a significant amount of object-level information is suspect and cannot be verified.
Common tells for the evil team.
Avoiding speaking much or sharing concrete information. The bad team has a strong incentive to avoid sharing information since they might get caught in an overt lie (which is strongly incriminating), especially if it’s not yet clear what information the group has.
Faulty logic. Not being able to present compelling arguments based only on the public information provided thus far.
Being too nervous / defensive about being accused. Being overly concerned about defusing suspicion, e.g. via bargaining (“If I do X, will you believe me?”) or attacking the accuser rather than their argument.
Not directly claiming innocence. Many people avoid saying ‘I’m on the good team’ or similar direct statements, out of a subconcious reluctance to lie. When accused, they either ‘capitulate’ (‘I guess you’ve decided to kill me’), or try too hard to cast doubt on the accuser.
Being too strongly opinionated. Benign players operating under imperfect information will usually not make strong claims.
Seeming too coherent. Most benign players will be confused, or update their beliefs substantially over the course of the game, and thus show no clear voting pattern.
None of the above mistakes are conclusive in and of themselves, but multiple occurrences form a behavioural pattern, and should increase suspicion accordingly.
Also, most of the common tells also qualify as ‘common mistakes for the benign team’, since benign teammates doing these things makes them present as evil.
Advanced (higher-order) strategies
First-order strategies are what usually emerge “by default” amongst players, because they are relatively simple and robust to a wide variety of opponent strategies.
Higher-order strategies emerge when players can reliably predict how opponents will act, and then take adversarial actions to subvert their opponents’ policies. E.g. an evil player might take actions which are detrimental to themselves on the surface level, in order to gain credibility with the benign players. But this will only be worthwhile if doing so has a large and reliable effect; ie if the benign team has a stable strategy.
Most players never get good at executing the first-order strategies and so most local metagames never evolve beyond these first-order strategies. That is a shame, because I think social deduction games are at their best when they involve higher-order considerations, of the “I think that you think that I think...” kind.
The ultimate goal: just have fun.
Again, the intention of this post is not to be prescriptive, and people should play games how they want. Many people might not find the kind of gameplay I’ve outlined to be fun.
But for the subset who do, I hope this post serves as an outline of how to create that fun reliably.
What effect does LLM use have on the quality of people’s thinking / knowledge?
I’d expect a large positive effect from just making people more informed / enabling them to interpret things correctly / pointing out fallacies etc.
However there might also be systemic biases on specific topics that propagate to people as a result
I’d be interested in anthropological / sociol science studies that investigate how LLM use changes people’s opinions and thinking, across lots of things
Yes, this is true—law of reversing advice holds. But I think two other things are true:
Intentionally trying to do more things makes you optimise more deliberately for being able to do many things. Having that ability is valuable, even if you don’t always exercise it
I think most people aren’t ’living their best lives”, in the sense that they’re not doing the volume of things they could be doing
Attention / energy is a valuable and limited resource. It should be conserved for high value things
It’s possibly not worded very well as you say :) I think it’s not a contradiction, because you can be doing only a small number of things at any given instant in time, but have an impressive throughput overall.
“How to do more things”—some basic tips
Why do more things? Tl;dr It feels good and is highly correlated with other metrics of success.
According to Maslow’s hierarchy, self-actualization is the ultimate need. And this comes through the feeling of accomplishment—achieving difficult things and seeing tangible results. I.e doing more things
I claim that success in life can be measured by the quality and quantity of things we accomplish. This is invariant to your specific goals / values. C.f. “the world belongs to high-energy people”.
The tips
Keep track of things you want to do. Good ideas are rare and easily forgotten.
Have some way to capture ideas as they come. Sticky notes. Notebooks. Voice memos. Docs. Emails. Google calendar events. Anything goes.
Make sure you will see these things later. Personally I have defaulted to starting all my notes in Todoist now; it’s then very easy to set a date where I’ll review them later.
80⁄20 things aggressively to make them easier to do.
Perfection is difficult and takes a long time. Simple versions can deliver most of the value. This advice has been rehashed many times elsewhere so I won’t elaborate.
Almost all things worth doing have a shitty equivalent you can do in a day. or in a few hours.
Focus on one thing at a time. Avoid ‘spreading attention’ too thin.
Attention / energy is a valuable and limited resource. It should be conserved for high value things.
“What is the most important thing you could be doing?” … “Why aren’t you doing that?”
I personally find that my motivation for things comes in waves. When motivation strikes I try to ‘ride the wave’ for as long as possible until I reach a natural stopping point.
When you do have to stop, make sure you can delete the context from your mind and pick it up later. Write very detailed notes for yourself. Have a “mind like water”.
I’ve found voice recordings or AI voice-to-text very helpful along these lines.
Get feedback early and often.
Common pitfall: hoard idea and work in isolation.
The more you 80⁄20 things the easier this will be.
Some tech stacks / tools / resources for research. I have used most of these and found them good for my work.
TODO: check out https://www.lesswrong.com/posts/6P8GYb4AjtPXx6LLB/tips-and-code-for-empirical-research-workflows#Part_2__Useful_Tools
Finetuning open-source language models.
Docker images: Nvidia CUDA latest image as default, or framework-specific image (e.g Axolotl)
Orchestrating cloud instances: Runpod
Launching finetuning jobs: Axolotl
Efficient tensor ops: FlashAttention, xFormers
Multi-GPU training: DeepSpeed
[Supports writing custom cuda kernels in Triton]
Monitoring ongoing jobs: Weights and Biases
Storing saved model checkpoints: Huggingface
Serving the trained checkpoints: vLLM.
[TODO: look into llama-cpp-python and similar things for running on worse hardware]
Finetuning OpenAI language models.
End-to-end experiment management: openai-finetuner
Evaluating language models.
Running standard benchmarks: Inspect
Running custom evals: [janky framework which I might try to clean up and publish at some point]
AI productivity tools.
Programming: Cursor IDE
Thinking / writing: Claude
Plausibly DeepSeek is now better
More extensive SWE: Devin
[TODO: look into agent workflows, OpenAI operator, etc]
Basic SWE
Managing virtual environments: PDM
Dependency management: UV
Versioning: Semantic release
Linting: Ruff
Testing: Pytest
CI: Github Actions
Repository structure: PDM
Repository templating: PDM
Building wheels for distribution: PDM
[TODO: set up a cloud development workflow]
Research communication.
Quick updates: Google Slides
Extensive writing: Google Docs, Overleaf
Some friends have recommended Typst
Making figures: Google Draw, Excalidraw
Experimenting with writing notes for my language model to understand (but not me).
What this currently look like is just a bunch of stuff I can copy-paste into the context window, such that the LM has all relevant information (c.f. Cursor ‘Docs’ feature).
Then I ask the language model to provide summaries / answer specific questions I have.
Ah I see.
I think b) doesn’t need to be true, responding in “hello” acrostics is just different from how any typical english speaker responds. Ditto for c), responding in acrostics is probably the main way in which it’s different from the typical english speaker.
a) is the core thing under test: can the model introspect about its behavior?
I think this is a specific example of language models articulating their own policies, which is an instance of introspection
How is the model differentiating behavior it learned in its original training and behavior it learned during its fine-tuning for this task?
Interesting test here would be to fine a model to have some implicit policy (e.g. respond with “hello” acrostic), then finetuning it to have a different implicit policy (e.g. respond with “goodbye” acrostic), and then asking it questions about all three of those policies
Thanks egg, great thoughts!
Seems likely to just be surprisal-based under the hood; ‘I used a word here that I wouldn’t expect myself to have used’
Yup this makes sense; although this still seems like a pretty incredible claim because it requires:
The encoder and decoder instance both know what kinds of words it uses (c.f. being able to articulate their goals), and this is robust enough to transmit information
The encoder and decoder both spontaneously settle on this as a schelling point for encoding the message.
I’m pretty skeptical that a human would be able to coordinate with a hypothetical clone of themselves this well, especially in the randomly-generated passphrase setting.
why Deepseek-v3 as opposed to eg Claude?
No particular reason beyond lack of time! If I do a more systematic version of this I will definitely try to replicate this in more models.
As originally conceived, this is sort of like a “dangerous capability” eval for steg. The argument being that, if a model can do steg in this very toy setting where we’ve significantly nudged the model, it might do steg in more realistic scenarios.
There is also a claim here of the form “language models can decode things from their own encodings that other language models cannot, due to having access to privileged internal information”.
I didnt’ make these points especially clear in the slide deck—thanks for the feedback!
Agreed on the rest of points!
Can frontier language models engage in collusion for steganography? Here is a write-up of a preliminary result along these lines, showing that Deepseek-v3 may be able to collude with other instances of itself to do steganography. And also that this steganography might be more subtle than we think.
Epistemic status; highly uncertain (and I’m sorry if this ends up being overclaiming, but I’m very excited at the moment).
“Emergent obfuscation”: A threat model for verifying the CoT in complex reasoning.
It seems likely we’ll eventually deploy AI to solve really complex problems. It will likely be extremely costly (or impossible) to directly check outcomes, since we don’t know the right answers ourselves. Therefore we’ll rely instead on process supervision, e.g. checking that each step in the CoT is correct.
Problem: Even if each step in the CoT trace is individually verifiable, if there are too many steps, or the verification cost per step is too high, then it may be impractical to fully verify the CoT trace.
Exhibit A: Shinichi Mochizuki’s purported proof of the abc conjecture. Despite consuming tens of thousands of collective mathematician man-hours to date, we’ve not managed to conclusively prove it correct or wrong. (Most leading mathematicians now think it has serious errors, but are unable to prove it.)
Exhibit B: Neural networks. Each neuron’s computation is verifiable. But it is very difficult to make meaningful / important verifiable statements about the collective behaviour of a neural network.
If we can’t do that, we’ll need to optimally trade-off verification budget with the risk of a model making an undetected incorrect conclusion due to unverified faulty reasoning.
If we assume ‘trusted’ models whose reasoning we don’t need to verify, we might be able to adapt protocols from AI control for Pareto-optimal verification.
[Footnote: Here I haven’t mentioned scalable oversight protocols (like debate); while they offer a stable training objective for improving model capabilities, they are very expensive, and I don’t see a good way to convert them to cost-efficient verification protocols].
I get the sense that I don’t read actively very much. By which I mean I have a collection of papers that have seemed interesting based on abstract / title but which I haven’t spent the time to engage further.
For the next 2 weeks, will experiment with writing a note every day about a paper I find interesting.
The tasks I delegate to AI are very different from what I thought they’d be.
When I first started using AI for writing, I thought I’d brainstorm outlines of thoughts then use AI to edit into a finished essay.
However I find myself often doing the reverse: Using AI as a thinking assistant to get a broad outline and write a rough draft, then doing final editing myself.
I think this is consistent with how people delegate to other people.
Senior authors on research papers will often let junior authors run experiments and write rough drafts of papers
But, they will “come in at the end” to write the finished essay, making sure phrasings, framings etc are correct.
I suspect this comes down to matters of subtle taste.
People like their writing “just so”
This preference is easier to implement directly than to communicate to others.
C.f. craftsmanship is highly personal.
I.e. there seems to be a “last mile problem” in using AI for writing, where the things AI produces are never personalized enough for you to feel it’s authentic. This last mile problem seems hard to solve.