RogerDearnaley

Karma: 1,568

I’m an staff artificial intelligence engineer currently working with LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I’m now actively looking for employment working in this area.

RogerDearnaley Apr 14, 2025, 8:49 AM
LW: 2 AF: 1
0
AF
on: Towards a scale-free theory of intelligent agency
I’m not sure that I see a reason to expect strict scale-freedom to be an accurate assumption. Consider two different common classes of agents:
1) evolved agents (such as humans)
2) engineered/trained artificial agents (such as LLM-powered agents)
For evolved agents, the evolutionary incentives that they evolved under apply differently in the case of subsystems withing a single living individual, versus multiple living individuals within a community that shows collective agentic behavior. Individual subsystems within a single individual succeed or fail (live and pass on their genes, or die) together, so evolutionarily they have a strong incentive to cooperate to form a single effective agent. Whereas separate living individuals withing a community have (evolutionarily speaking) separate (though perhaps correlated) success/failure criteria, so have evolutionary pressures on them both to compete and also (in non-zero-sum situations) to cooperate. So I would not expect evolved agents to be scale-free — the single-individual scale at which evolution applies seems almost inevitable to be privileged.
For agents consisting of communities of cooperating humans, I would expect communities smaller than Dunbar’s Number (ones small enough to be able to operate without hierarchies or bureaucracy, because all members know each other well) to operate differently than larger communities, so again I would not expect them to be entirely scale-free.
For engineered agents, the situation is far less clear-cut, and depends upon the cognitive limits and engineering techniques of whoever is engineering the agents. For current LLM-based artificial agents, they are normally pretrained and fine-tuned alone, and (where applicable) RL-trained against a policy model. So for this current approach, the single-agent scale used during training also seems almost inevitable to be privileged.
Also, pretraining distills human intelligence into the AI via the training corpora, so if the single-human scale is privileged for evolutionary reasons, this phenomenon seems likely to be transferred to LLMs during their pretraining phase.
However, I would agree that if a community is operating as a (fairly-effective) agent, then it’s cooperating components must be mostly cooperating, more than they are competing, so I agree that there might well be some semi-scale-free aspects to its behavior. However, there might well also be some scale-dependent constraints on or inefficiencies of it’s behavior, because it’s components were prone to not always cooperating as well as subsystems of an individual would.

RogerDearnaley Jan 29, 2025, 6:11 AM
1 point
1
in reply to: Andrew CC’s comment on: Why Aligning an LLM is Hard, and How to Make it Easier
The history of autocracies and monarchies suggests that taking something with the ethical properties of an average human being and handing it unconstrained power doesn’t usually work out very well. So yes, to create an aligned ASI that is safe for us to share a planet with does require creating something morally ‘better’ than most humans. I’m not sure it needs to be perfect and ideal, as long as it is good enough and aspires to improve: they it can help us create better training data for its upgraded next version that will make that be closer to fully aligned; this is an implementation of Value Learning.

RogerDearnaley Jan 23, 2025, 7:02 AM
3 points
1
on: are there 2 types of alignment?
I guess the way I look at it is that “alignment” means “an AI system whose terminal goal is to achieve your goals”. The distinction here is then whether the word ‘your’ means something closer to:
1. the current user making the current request
2. the current user making the current request, as long as the request is legal and inside the terms of service
3. the shareholders of the foundation lab that made the AI
4. all (righthinking) citizens of the country that foundation lab is in (and perhaps its allies)
5. all humans everywhere, now and in the future
6. all sapient living beings everywhere, now and in the future
7. something even more inclusive
Your first option would be somewhere around item 5. or 6. on this list, while your second option would be closer to items 1., 2. or 3.
If AI doesn’t kill or disenfranchise all of us, then which option on this spectrum of possibilities ends up being implemented is going to make a huge difference to how history will play out over the next few decades.

Why Aligning an LLM is Hard, and How to Make it Easier

RogerDearnaleyJan 23, 2025, 6:44 AM

30 points

3 comments4 min readLW link

RogerDearnaley Jan 21, 2025, 12:16 AM
4 points
2
in reply to: ryan_greenblatt’s comment on: Alignment Faking in Large Language Models
Another approach is doing alignment training during SGD pre-training by adding a significant amount of synthetic data demonstrating aligned behavior. See for example the discussion of this approach in A “Bitter Lesson” Approach to Aligning AGI and ASI, and similar discussions.
This is predicated on the assumption that alignment-faking will by more successfully eliminated by SGD during pretraining than RL after instruct-training, because a) the feedback signal used is much denser, and b) during pre-training the model is a simulator for a wide variety of personas rather than having been deliberately narrowed down to mostly simulate just a single viewpoint, so it won’t have a consistent alignment and thus won’t consistently display alignment-faking behavior.

RogerDearnaley Jan 20, 2025, 11:55 PM
4 points
0
on: Worries about latent reasoning in LLMs
Well summarized — very similar to the conclusions I’d previously reached when I read the paper.

RogerDearnaley Jan 20, 2025, 11:22 PM
2 points
0
in reply to: RogerDearnaley’s comment on: Compact Proofs of Model Performance via Mechanistic Interpretability
Another variant would be, rather than replacing what you believe is structureless noise with actual structureless noise as an intervention, to simply always run the model with an additional noise term added to each neuron, or to the residual stream between each layer, or whatever, both during training and inference. (combined with a weight decay or a loss term on activation amplitudes, this soft-limits the information capacity of any specific path through the neural net). This then forces any real mechanisms in the model to operate above this background noise level: so then, once you understand how the background noise level is propagated through the model, it becomes clear that any unexplained noise below that level is in fact structureless, since any structure will be washed out by the injected noise, whereas any unexplained noise level above that, while it could still be structureless, seems more likely to be unexplained structure.
(Note that this architectural change also gives the model a new non-linearity to use: in the presence of a fixed noise term, changes in activation norm near the noise level have non-linear effects.)

Quantizing model weights during training also has a somewhat similar effect, but is likely harder to analyze, since now the information capacity limit is per weight, not per data path.

RogerDearnaley Jan 18, 2025, 6:35 AM
4 points
2
on: Model Amnesty Project
Law-abiding – It cannot acquire money or compute illegally (fraud, theft, hacking, etc.) and must otherwise avoid breaking the law
Can it lobby? Run for office? Shop around for jurisdictions? Super-humanly persuade the electorate? Just find loopholes and workarounds to the law that make a corporate tax double-Irish look principled and simple?

RogerDearnaley Jan 18, 2025, 6:29 AM
9 points
6
on: What are the plans for solving the inner alignment problem?
Evolution was working within tight computational efficiency limits (the human brain burns roughly ¹⁄₆ of our total calories), using a evolutionary algorithm rather than gradient descent training scheme which is significantly less efficient, and we’re now running the human brain well outside it’s training distribution (there were no condoms on the Savannah) — nevertheless, the human population is 8 billion and counting, and we dominate basically every terrestrial ecosystem on the planet. I think some people overplay how much inner alignment failure there is between human instincts and human genetic fitness.
So:
1. Use a model large enough to learn what you’re trying to teach it
2. Use stochastic gradient descent
3. Ask your AI to monitor for inner alignment problems (we do know Doritos are bad for us)
4. Retrain if you find yourself far enough outside your training distribution that inner alignment issues are becoming a problem

RogerDearnaley Jan 17, 2025, 11:27 AM
5 points
0
on: A Novel Emergence of Meta-Awareness in LLM Fine-Tuning
That is an impressive (and amusing) capability!
Presumably the fine-tuning enhanced previous experience in the model with acrostic text. That also seems to have enhanced the ability to recognize and correctly explain that the text is an acrostic, even with only two letters of the acrostic currently in context. Presumably it’s fairly common to have both an acrostic and an explanation of it in the same document. What I suspect is rarer in the training data is for the acrostic text to explain itself, as the model’s response did here (though doubtless there are some examples somewhere). However, this is mostly just combining two skills, something LLMs are clearly capable of — the impressive part here is just that the model was aware, at the end of the second line, what word starting with “HE” the rest of the acrostic was going to spell out.
It would be interesting to look at this in the activation space — does the model already have a strong internal activation somewhere inside it for “HELLO” (or perhaps (“H… E… L… L… O…”) even while it’s working on generating the first or second line? It presumably needs to have something like this to be able generate acrostics, and previous work has suggested that there are directions for “words starting with the letter <X>” in the latent spaces of typical models.

RogerDearnaley Jan 16, 2025, 11:01 AM
3 points
0
in reply to: Noosphere89’s comment on: What Is The Alignment Problem?
If we had evolved in an environment in which the only requirement on physical laws/rules was that they are Turing computable (and thus that they didn’t have a lot of symmetries or conservation laws or natural abstractions), then in general the only way to make predictions is to do roughly as much computation as your environment is doing. This generally requires your brain to be roughly equal in computational capacity, and thus similar in size, to the entire rest of its environment (including its body). This is not an environment in which the initial evolution of life is viable (nor, indeed, any form of reproduction). So, to slightly abuse the anthropic principle, we don’t need to worry about it.

RogerDearnaley Jan 14, 2025, 9:47 AM
9 points
0
on: Finding Features Causally Upstream of Refusal
Darn, exactly the project I was hoping to do at MATS! :-) Nice work!
There’s pretty suggestive evidence that the LLM first decides to refuse (and emits token’s like “I’m sorry”), then later writes a justification for refusing (see some of the hilarious reasons generated for not telling you how to make a teddy bear, after being activation engineered into refusing this). So I would view arguing anything about the nature of the refusal process from the text of the refusal-justification given afterwards as circumstantial evidence at best. But then you have direct gradient evidence that these directions matter, so I suppose the refusal texts you quote, if considered just as an argument as to why it’s sensible model behavior that that direction ought to matter (as opposed to evidence that it does), are helpful — however, I think you might want to make this distinction clearer in your write-up.

Looking through Latent 2213, my impression is that a) it mostly triggers on a wide variety of innocuous-looking tokens indicating the ends of phrases (so likely it’s summarizing those phrases), and b) those phrases tend to be about a legal, medical, or social process or chain of consequences causing something really bad to happen (e.g. cancer, sexual abuse, poisoning). This also rather fits with the set of latents that it has significant cosine similarity to. So I’d summarize it as “a complex or technically-involved process leading to a dramatically bad outcome”.
If that’s accurate, then it tending to trigger the refusal direction makes a lot of sense.

RogerDearnaley Jan 14, 2025, 9:23 AM
2 points
0
on: A Three-Layer Model of LLM Psychology
Fascinating, and a great analysis!
I think it’s interesting to compare and contrast this with the model I describe in Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor — your three layers don’t exactly correspond to the stage, the animatronics, or the puppetter, but there are some similarities: your ocean is pretty close to the the stage, for example. I think both mental models are quite useful, and the interplay between the viewpoints of them might be more so.

RogerDearnaley Jan 14, 2025, 8:17 AM
5 points
0
on: The Field of AI Alignment: A Postmortem, and What To Do About It
I have a Theoretical Physics PhD in String Field Theory from Cambridge — my reaction to hard problems is to try to find a way of cracking them that no-one else is trying. Please feel free to offer to fund me :-)

RogerDearnaley Dec 18, 2024, 8:38 AM
3 points
−1
on: Ideas for benchmarking LLM creativity
People who train text-to-image generative models have had a good deal of success with training (given a large enough and well-enough human-labeled training set) an “aesthetic quality” scoring model, and then training a generative image model to have “high aesthetic quality score” as a text label. Yes, doing things like this can produces effects like the recognizable Midjourney aesthetic, which can be flawed, and generally optimizing such things too hard leads to sameness — but if trained well such models’ idea of aesthetic quality is at least pretty close to most human judgements. Presumably what can be done for images can also be done for prose, poetry, or fiction as well.
There isn’t a direct equivalent of that approach for an LLM, but RLHF seems like a fairly close equivalent. So far people have primarily used RLHF for “how good is the answer to my question?” Adapting a similar approach to “how high quality is the poetry/prose/fiction produced by the model?” is obviously feasible. Then you just need a large number of high quality human judgements from a representative cross-section of people with good taste in poetry/prose/fiction: hiring professional human editors or literary talent scouts seems like a good idea. One of the good things about foundation model sizes and training costs going up is that reasonable budgets for fine-tuning should also increase proportionately.
Another option would be to train or fine-tune the quality scoring model used for the RL on literary sources (books, poetry, etc) with quality labels drawn from relatively objective existing data, like total sales, literary awards, critical rankings, reviews from good reviewers, and so forth.
The RLHF approach only trains a single aesthetic, and probably shouldn’t be taken too far or optimized too hard: while there is some widespread agreement about what prose is good vs, dreadful, finer details of taste vary, and should do so. So the obvious approach for finer-grained style control would be to train or fine-tune on a training set of a large number documents each of which consists of a prompt-like description/review/multiple reviews of a literary work, giving a variety of different types of aesthetic opinions and objective measures of its quality, followed by the corresponding literary work itself.
These ideas have been phrased as model-post-training suggestions, but turning these into a benchmark is also feasible: the “Aesthetic quality scoring model” from the RLHF approach is in itself a benchmark, and the “prompt containing reviews and statistics → literary work” approach could also be inverted to instead train a reviewer model to review literary works from various different aesthetic viewpoints, and estimate their likely sales/critical reception.

RogerDearnaley Dec 4, 2024, 5:42 AM
5 points
2
in reply to: Satron’s comment on: Requirements for a Basin of Attraction to Alignment
1. Value learning converges to full alignment by construction: since a value learning AI basically starts with the propositions:
  a) as an AI, I should act fully aligned to human values
  b) I do not fully understand what human value are, or how to act fully aligned to them, so in order to be able to do this I need to learn more about human values and how to act fully aligned to them, by applying approximately Bayesian learning to this problem
  c) Here are some Bayesian priors about what human values are, and how to act fully aligned to them: <insert initialization information here>…
2. As usual for a Bayesian learning problem, as long as the Bayesian priors 1 c) are not completely screwed up as a place to start from, this will converge. Thus there is a region of convergence to full alignment.
3. LLMs have a very large amount of detailed information about what human values are and how to act aligned to them. Thus they provide a very detailed set of Bayesian priors for 1 c).
4. Also, training an LLM is a fairly good approximation to Bayesian learning. Thus (with suitable additions to enable online learning) they provide one possible implementation for the Bayesian learning process required by 1 b). For example, one could apply fine-tuning to the LLM to incorporate new information, and/or periodically retrain the LLM based on the training set plus new information the AI has gathered during the value learning process.

RogerDearnaley Nov 24, 2024, 9:43 PM
4 points
0
in reply to: Nikola Jurkovic’s comment on: DeepSeek beats o1 on math and ties on coding; will release weights
There had been a number of papers published over the last year on how to do this kind of training, and for roughly a year now there have been rumors that OpenAI were working on it. If converting that into a working version is possible for a Chinese company like DeepSeek, as it appears, then why haven’t Anthropic and Google released versions yet? There doesn’t seem to be any realistic possibility that DeepSeek actually have more compute or better researchers than both Anthropic and Google.
One possible interpretation would be that this has significant safety implications, and Anthropic and Google are both still working through these before releasing.
Another possibility would be that Anthropic has in fact released, in the sense that their Claude models’ recent advances in agentic behavior (while not using inference-time scaling) are distilled from reasoning traces generated by an internal-only model of this type that is using inference-time scaling.

RogerDearnaley Nov 24, 2024, 9:24 PM
2 points
0
on: Disentangling Representations through Multi-task Learning
If correct, this looks like an important theoretical advance in understanding why and under what conditions neural nets can generalize outside their training distribution.

RogerDearnaley Nov 20, 2024, 9:54 PM
3 points
0
in reply to: claudia.biancotti’s comment on: Chat Bankman-Fried: an Exploration of LLM Alignment in Finance
So maybe part of the issue here is just that deducing/understanding the moral/ethical consequences of the options being decided between is a bit inobvious most current models, other than o1? (It would be fascinating to look at the o1 CoT reasoning traces, if only they were available.)
In which case simply including a large body of information on the basics of fiduciary responsibility (say, a training handbook for recent hires in the banking industry, or something) into the context might make a big difference for other models. Similarly, the possible misunderstanding of what ‘auditing’ implies could be covered in a similar way.
A much more limited version of this might be to simply prompt the models to also consider, in CoT form, the ethical/legal consequences of each option: that tests whether the model is aware of what fiduciary responsibility is, that it’s relevant, and how to apply it, if it is simply prompted to consider ethical/legal consequences. That would probably be more representative of what current models could likely do with minor adjustments to their alignment training or system prompts, the sorts of changes the foundation model companies could likely do quite quickly.

RogerDearnaley Nov 20, 2024, 9:42 PM
2 points
0
in reply to: CallumMcDougall’s comment on: Toy Models of Feature Absorption in SAEs
I think an approach I’d try would be to keep the encoder and decoder weights untied (or possibly add a loss term to mildly encourage them to be similar), but then analyze the patterns between them (both for an individual feature and between pairs of features) for evidence of absorption. Absorption is annoying, but it’s only really dangerous if you don’t know it’s happening and it causes you to think a feature is inactive when it’s instead inobviously active via another feature it’s been absorbed into. If you can catch that consistently, then it turns from concerning to merely inconvenient.
This is all closely related to the issue of compositional codes: absorption is just a code entry that’s compositional in the absorbed instances but not in other instances. The current standard approach to solving that is meta SAEs, which presumably should also help identify absorption. It would be nice to have a cleaner and simpler process than that: than that I’ve been wondering if it would be possible to modify top-k or jump-RELU SAEs so that the loss function cost for activating more common dictionary entries is lower, in a way that would encourage representing compositional codes directly in the SAE as two-or-more more common activations rather than one rare one. Obviously you can’t overdo making common entries cheap, otherwise your dictionary will just converge on a basis for the embedding space you’re analyzing, all $d$ of which are active all the time — I suspect using something like a cost proportional to $l n (m a x (d, 1 / f))$ might work, where $d$ is the dimensionality of the underlying embedding space and $f$ is the frequency of the dictionary entry being activated.

RogerDearnaley

Why Align­ing an LLM is Hard, and How to Make it Easier

Why Aligning an LLM is Hard, and How to Make it Easier