I’m an staff artificial intelligence engineer currently working with LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I’m now actively looking for employment working in this area.
RogerDearnaley
I guess the way I look at it is that “alignment” means “an AI system whose terminal goal is to achieve your goals”. The distinction here is then whether the word ‘your’ means something closer to:
the current user making the current request
the current user making the current request, as long as the request is legal and inside the terms of service
the shareholders of the foundation lab that made the AI
all (righthinking) citizens of the country that foundation lab is in (and perhaps its allies)
all humans everywhere, now and in the future
all sapient living beings everywhere, now and in the future
something even more inclusive
Your first option would be somewhere around item 5. or 6. on this list, while your second option would be closer to items 1., 2. or 3.
If AI doesn’t kill or disenfranchise all of us, then which option on this spectrum of possibilities ends up being implemented is going to make a huge difference to how history will play out over the next few decades.
Why Aligning an LLM is Hard, and How to Make it Easier
Another approach is doing alignment training during SGD pre-training by adding a significant amount of synthetic data demonstrating aligned behavior. See for example the discussion of this approach in A “Bitter Lesson” Approach to Aligning AGI and ASI, and similar discussions.
This is predicated on the assumption that alignment-faking will by more successfully eliminated by SGD during pretraining than RL after instruct-training, because a) the feedback signal used is much denser, and b) during pre-training the model is a simulator for a wide variety of personas rather than having been deliberately narrowed down to mostly simulate just a single viewpoint, so it won’t have a consistent alignment and thus won’t consistently display alignment-faking behavior.
Well summarized — very similar to the conclusions I’d previously reached when I read the paper.
Another variant would be, rather than replacing what you believe is structureless noise with actual structureless noise as an intervention, to simply always run the model with an additional noise term added to each neuron, or to the residual stream between each layer, or whatever, both during training and inference. (combined with a weight decay or a loss term on activation amplitudes, this soft-limits the information capacity of any specific path through the neural net). This then forces any real mechanisms in the model to operate above this background noise level: so then, once you understand how the background noise level is propagated through the model, it becomes clear that any unexplained noise below that level is in fact structureless, since any structure will be washed out by the injected noise, whereas any unexplained noise level above that, while it could still be structureless, seems more likely to be unexplained structure.
(Note that this architectural change also gives the model a new non-linearity to use: in the presence of a fixed noise term, changes in activation norm near the noise level have non-linear effects.)
Quantizing model weights during training also has a somewhat similar effect, but is likely harder to analyze, since now the information capacity limit is per weight, not per data path.
Law-abiding – It cannot acquire money or compute illegally (fraud, theft, hacking, etc.) and must otherwise avoid breaking the law
Can it lobby? Run for office? Shop around for jurisdictions? Super-humanly persuade the electorate? Just find loopholes and workarounds to the law that make a corporate tax double-Irish look principled and simple?
Evolution was working within tight computational efficiency limits (the human brain burns roughly 1⁄6 of our total calories), using a evolutionary algorithm rather than gradient descent training scheme which is significantly less efficient, and we’re now running the human brain well outside it’s training distribution (there were no condoms on the Savannah) — nevertheless, the human population is 8 billion and counting, and we dominate basically every terrestrial ecosystem on the planet. I think some people overplay how much inner alignment failure there is between human instincts and human genetic fitness.
So:
Use a model large enough to learn what you’re trying to teach it
Use stochastic gradient descent
Ask your AI to monitor for inner alignment problems (we do know Doritos are bad for us)
Retrain if you find yourself far enough outside your training distribution that inner alignment issues are becoming a problem
That is an impressive (and amusing) capability!
Presumably the fine-tuning enhanced previous experience in the model with acrostic text. That also seems to have enhanced the ability to recognize and correctly explain that the text is an acrostic, even with only two letters of the acrostic currently in context. Presumably it’s fairly common to have both an acrostic and an explanation of it in the same document. What I suspect is rarer in the training data is for the acrostic text to explain itself, as the model’s response did here (though doubtless there are some examples somewhere). However, this is mostly just combining two skills, something LLMs are clearly capable of — the impressive part here is just that the model was aware, at the end of the second line, what word starting with “HE” the rest of the acrostic was going to spell out.
It would be interesting to look at this in the activation space — does the model already have a strong internal activation somewhere inside it for “HELLO” (or perhaps (“H… E… L… L… O…”) even while it’s working on generating the first or second line? It presumably needs to have something like this to be able generate acrostics, and previous work has suggested that there are directions for “words starting with the letter <X>” in the latent spaces of typical models.
If we had evolved in an environment in which the only requirement on physical laws/rules was that they are Turing computable (and thus that they didn’t have a lot of symmetries or conservation laws or natural abstractions), then in general the only way to make predictions is to do roughly as much computation as your environment is doing. This generally requires your brain to be roughly equal in computational capacity, and thus similar in size, to the entire rest of its environment (including its body). This is not an environment in which the initial evolution of life is viable (nor, indeed, any form of reproduction). So, to slightly abuse the anthropic principle, we don’t need to worry about it.
Darn, exactly the project I was hoping to do at MATS! :-) Nice work!
There’s pretty suggestive evidence that the LLM first decides to refuse (and emits token’s like “I’m sorry”), then later writes a justification for refusing (see some of the hilarious reasons generated for not telling you how to make a teddy bear, after being activation engineered into refusing this). So I would view arguing anything about the nature of the refusal process from the text of the refusal-justification given afterwards as circumstantial evidence at best. But then you have direct gradient evidence that these directions matter, so I suppose the refusal texts you quote, if considered just as an argument as to why it’s sensible model behavior that that direction ought to matter (as opposed to evidence that it does), are helpful — however, I think you might want to make this distinction clearer in your write-up.
Looking through Latent 2213, my impression is that a) it mostly triggers on a wide variety of innocuous-looking tokens indicating the ends of phrases (so likely it’s summarizing those phrases), and b) those phrases tend to be about a legal, medical, or social process or chain of consequences causing something really bad to happen (e.g. cancer, sexual abuse, poisoning). This also rather fits with the set of latents that it has significant cosine similarity to. So I’d summarize it as “a complex or technically-involved process leading to a dramatically bad outcome”.If that’s accurate, then it tending to trigger the refusal direction makes a lot of sense.
Fascinating, and a great analysis!
I think it’s interesting to compare and contrast this with the model I describe in Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor — your three layers don’t exactly correspond to the stage, the animatronics, or the puppetter, but there are some similarities: your ocean is pretty close to the the stage, for example. I think both mental models are quite useful, and the interplay between the viewpoints of them might be more so.
I have a Theoretical Physics PhD in String Field Theory from Cambridge — my reaction to hard problems is to try to find a way of cracking them that no-one else is trying. Please feel free to offer to fund me :-)
People who train text-to-image generative models have had a good deal of success with training (given a large enough and well-enough human-labeled training set) an “aesthetic quality” scoring model, and then training a generative image model to have “high aesthetic quality score” as a text label. Yes, doing things like this can produces effects like the recognizable Midjourney aesthetic, which can be flawed, and generally optimizing such things too hard leads to sameness — but if trained well such models’ idea of aesthetic quality is at least pretty close to most human judgements. Presumably what can be done for images can also be done for prose, poetry, or fiction as well.
There isn’t a direct equivalent of that approach for an LLM, but RLHF seems like a fairly close equivalent. So far people have primarily used RLHF for “how good is the answer to my question?” Adapting a similar approach to “how high quality is the poetry/prose/fiction produced by the model?” is obviously feasible. Then you just need a large number of high quality human judgements from a representative cross-section of people with good taste in poetry/prose/fiction: hiring professional human editors or literary talent scouts seems like a good idea. One of the good things about foundation model sizes and training costs going up is that reasonable budgets for fine-tuning should also increase proportionately.
Another option would be to train or fine-tune the quality scoring model used for the RL on literary sources (books, poetry, etc) with quality labels drawn from relatively objective existing data, like total sales, literary awards, critical rankings, reviews from good reviewers, and so forth.
The RLHF approach only trains a single aesthetic, and probably shouldn’t be taken too far or optimized too hard: while there is some widespread agreement about what prose is good vs, dreadful, finer details of taste vary, and should do so. So the obvious approach for finer-grained style control would be to train or fine-tune on a training set of a large number documents each of which consists of a prompt-like description/review/multiple reviews of a literary work, giving a variety of different types of aesthetic opinions and objective measures of its quality, followed by the corresponding literary work itself.
These ideas have been phrased as model-post-training suggestions, but turning these into a benchmark is also feasible: the “Aesthetic quality scoring model” from the RLHF approach is in itself a benchmark, and the “prompt containing reviews and statistics → literary work” approach could also be inverted to instead train a reviewer model to review literary works from various different aesthetic viewpoints, and estimate their likely sales/critical reception.
Value learning converges to full alignment by construction: since a value learning AI basically starts with the propositions:
a) as an AI, I should act fully aligned to human values
b) I do not fully understand what human value are, or how to act fully aligned to them, so in order to be able to do this I need to learn more about human values and how to act fully aligned to them, by applying approximately Bayesian learning to this problem
c) Here are some Bayesian priors about what human values are, and how to act fully aligned to them: <insert initialization information here>…As usual for a Bayesian learning problem, as long as the Bayesian priors 1 c) are not completely screwed up as a place to start from, this will converge. Thus there is a region of convergence to full alignment.
LLMs have a very large amount of detailed information about what human values are and how to act aligned to them. Thus they provide a very detailed set of Bayesian priors for 1 c).
Also, training an LLM is a fairly good approximation to Bayesian learning. Thus (with suitable additions to enable online learning) they provide one possible implementation for the Bayesian learning process required by 1 b). For example, one could apply fine-tuning to the LLM to incorporate new information, and/or periodically retrain the LLM based on the training set plus new information the AI has gathered during the value learning process.
There had been a number of papers published over the last year on how to do this kind of training, and for roughly a year now there have been rumors that OpenAI were working on it. If converting that into a working version is possible for a Chinese company like DeepSeek, as it appears, then why haven’t Anthropic and Google released versions yet? There doesn’t seem to be any realistic possibility that DeepSeek actually have more compute or better researchers than both Anthropic and Google.
One possible interpretation would be that this has significant safety implications, and Anthropic and Google are both still working through these before releasing.
Another possibility would be that Anthropic has in fact released, in the sense that their Claude models’ recent advances in agentic behavior (while not using inference-time scaling) are distilled from reasoning traces generated by an internal-only model of this type that is using inference-time scaling.
If correct, this looks like an important theoretical advance in understanding why and under what conditions neural nets can generalize outside their training distribution.
So maybe part of the issue here is just that deducing/understanding the moral/ethical consequences of the options being decided between is a bit inobvious most current models, other than o1? (It would be fascinating to look at the o1 CoT reasoning traces, if only they were available.)
In which case simply including a large body of information on the basics of fiduciary responsibility (say, a training handbook for recent hires in the banking industry, or something) into the context might make a big difference for other models. Similarly, the possible misunderstanding of what ‘auditing’ implies could be covered in a similar way.
A much more limited version of this might be to simply prompt the models to also consider, in CoT form, the ethical/legal consequences of each option: that tests whether the model is aware of what fiduciary responsibility is, that it’s relevant, and how to apply it, if it is simply prompted to consider ethical/legal consequences. That would probably be more representative of what current models could likely do with minor adjustments to their alignment training or system prompts, the sorts of changes the foundation model companies could likely do quite quickly.
I think an approach I’d try would be to keep the encoder and decoder weights untied (or possibly add a loss term to mildly encourage them to be similar), but then analyze the patterns between them (both for an individual feature and between pairs of features) for evidence of absorption. Absorption is annoying, but it’s only really dangerous if you don’t know it’s happening and it causes you to think a feature is inactive when it’s instead inobviously active via another feature it’s been absorbed into. If you can catch that consistently, then it turns from concerning to merely inconvenient.
This is all closely related to the issue of compositional codes: absorption is just a code entry that’s compositional in the absorbed instances but not in other instances. The current standard approach to solving that is meta SAEs, which presumably should also help identify absorption. It would be nice to have a cleaner and simpler process than that: than that I’ve been wondering if it would be possible to modify top-k or jump-RELU SAEs so that the loss function cost for activating more common dictionary entries is lower, in a way that would encourage representing compositional codes directly in the SAE as two-or-more more common activations rather than one rare one. Obviously you can’t overdo making common entries cheap, otherwise your dictionary will just converge on a basis for the embedding space you’re analyzing, all of which are active all the time — I suspect using something like a cost proportional to might work, where is the dimensionality of the underlying embedding space and is the frequency of the dictionary entry being activated.
Interesting. I’m disappointed to see the Claude models do so badly. Possibly Anthropic needs to extend their constitutional RLAIF to cover not committing financial crimes? The large different between o1 Preview and o1 Mini is also concerning.
The history of autocracies and monarchies suggests that taking something with the ethical properties of an average human being and handing it unconstrained power doesn’t usually work out very well. So yes, to create an aligned ASI that is safe for us to share a planet with does require creating something morally ‘better’ than most humans. I’m not sure it needs to be perfect and ideal, as long as it is good enough and aspires to improve: they it can help us create better training data for its upgraded next version that will make that be closer to fully aligned; this is an implementation of Value Learning.