I’m a staff artificial intelligence engineer and researcher working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 17 years. I did research into this during SERI MATS summer 2025, and am now an independent researcher at Meridian in Cambridge. I’m currently looking for either employment or funding to work on this subject in the London/Cambridge area in the UK.
RogerDearnaley
Seal of the Apocalypse: Does this work in people who swear all the time?
They tested this, and the answer is no. The BBC did a lovely segment on this, satarring Brian Blessed, who apparently swears all the time, and Stephen Fry, who apparently does not.
I found it: Less Online, 5pm on Sunday Jun 1st in Bayes Ground Floor
Last summer I attended a talk at LighHaven (in Bldg B on the first floor)) by Alexander Wales on prompting LLMs to write novels. I’m trying to figure out when this was, and whether it was at LessOnline or at MATS. Does anyone still have a copy of the LessOnline ’25 schedule who could look this talk up?
I’d like to offer my thoughts on this topic as another source to explore:
• A Sense of Fairness: Deconfusing Ethics: suggests a framework for considering the issue, and why an aligned AI would decline moral standing (the first post in a sequence, some later posts are also relevant)
• Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV: basing that framework on the context of the evolutionary psychology of humans
I think you should have titled or subtitled this “How Fast is my Centaur?” :-)
Interesting, exciting, and very valuable work.
I wish you’d labeled the points to help look for model family effects.
The point spreads are quite noisy — your conclusion that quality caps out looks very sensitive to a balance between two outlier points, one very high and the other very low: remove either of those and the story might change. Obviously there is enormous economic value in ensuring that low capability humans don’t drag high capability model outputs down, whether that requires retraining the model or training the human.
Is that 20% prediction a total increase in GDP over the period, or an annualized rate of increase?
The other problem is that Seth’s useful and thought-provoking map is in 2 dimensions, and humans are used to thinking in 1–3 dimensions (and have a visual cortext with 2-dimensional local connectivity, so lack the wetware for thinking in more than 2½ dimensions). LLM activations, KV values etc. are generally in O(8192) dimensions. High dimensional spaces just have a lot of statistical/geometric properties that are wildly unituitive to us, since we’re use to working in very low numbers of dimensions. (Also known as the curse of dimensionality.) You can, with practice, learn to recognize when your intuition is misleading you and what the correct answer is, but this takes quite a bit of practice. For example, if you pick two vectors at random in a high dimensional space, they will almost invariably be almost orthogonal to each other (the angle between them will be nearly 90 degrees). Random walks in high dimensional spaces practically never return to anywhere near any place they’re gone before: they continually get “more lost”. So a lot of the intuitions that Seth’s diagram gives are rather misleading: start in the middle of the “Bay of Dimensional Mismatch”, go a shortish distance in a random direction, and you will almost certainly end up deep at sea, rather than back on land as his 2-D map suggests. Nevertheless, all the effects he discusses are real and important—just be aware that there’s simply no way to diagram them in only 2 dimensions (or anything a human can visualize) that isn’t inherently rather misleading.
I’d suggest TurnTrout’s writing (Alex Turner at DeepMind), since he’s the person who first came up with the idea. Most of his posts are on LessWrong/The Aligment Forum, but they’re best organized on his own website. I’d suggest starting at https://turntrout.com/research, reading the section on Shard Theory, and following links.
He himself admits that some of his key posts often seem to get misunderstood: I think they repay careful reading and some thought.
I completely agree with you — I personally have filtered a thousand samples manually, and it takes a while. Finding a good human-LLM centaur solution is very helpful. Sounds like you know all the tricks at least as well as I do.
Does corrigibility make it harder to get such an international agreement, compared to alternative strategies? I can imagine an effect: the corrigible AI would serve the goals of just one principal, making it less trustworthy than, say, a constitutional AI. That effect seems hard to analyze. I expect that an AI will be able to mitigate that risk by a combination of being persuasive and being able to show that there are large risks to an arms race.
It’s easy to get a corrigible ASI not to use its persuasiveness on you: you tell it not to do that.
I think you need to think harder about that “hard to analyze” bit — it’s the fatal flaw (as in x-risk) of the corrigibility based approach. You don’t get behavior any wiser or more moral than the principal. And 2–4% of principals are sociopaths (the figures for tech titans and heads of authoritarian states may well be higher).
I assume there will be a nontrivial period when the AI behaves corrigibly in situations that closely resemble its training environment, but would behave incorrigibly in some unusual situations.
As a rule of thumb, anything smart enough to be dangerous is dangerous because it can do scientific research and self-improve. If it can’t tell when you’re out-of-distribution and might need to generate some new hypotheses, it can’t do scientific research, so it’s not that dangerous. So yes, there might be some very unusual situation that decreased corrigibility: but for an ASI, just taking it out-of-training-distribution should pretty-reliably cause it to say “I know that I don’t know what I’m doing, so I should be extra cautious/pessimistic, and that includes being extra-corrigible.”
The engineering feedback loop will use up all its fuel
I discussed this with Jeremy Gillen in the comments of his post, and I’m still not clear what he meant by ‘fuel’ here. Possibly something to do with the problem of “fully-updated deference”, a.k.a. the right to keep arbitrarily and inconsistently changing out minds?
This post will not address how hard it may be to ensure that an AI is corrigible, or the conflicts associated with an AI being corrigible to just one principal versus multiple principals. There may be important risks from AI being corrigible to the wrong person / people, but those are mostly outside the scope of this post.
This is where the difference between Corrigibility and Value Learning really kicks in. Consider two-or-more opposed groups of humans (two-or-more tech titans, nation states, whatever) with corrigible aligned ASIs: let’s assume the ASIs are smart, and learn to predict what their principals would correct, and how to extrapolate this correctly to situations too complex for the principals to understand. But they do not do anything more moral then or less confrontational than their principals — they just pursue their principal’s goals with superhuman intelligence. This seems like a winner-take-all competition. Principals who hopefully aren’t actually sociopaths, and don’t personally want to die, and thus don’t want humanity to go extinct, but who also don’t want to lose at power politics or games of Chicken.
On the other hand, suppose they had Value Learning ASIs. These learn human vslues, including that: first of all don’t kill all the humans. Extinction is forever, and the badness of killing all the humans is roughly minus the number of quality-adjusted-life-years there would have been in humanity’s future lightcone if you hadn’t killed all of them. This is hard to predict, but dominated by a long tail in which things go really well and humanity ends up spreading across the galaxy, giving a huge, literally astronomical number (like −10^25 or −10^30 quality-adjusted life years). So really, don’t kill all the humans. Also, don’t let them wipe themselves out in a nuclear war. In fact, keep a lid on their foolish conflicts over resources and status.
Corrigibility-based alignment of superintelligence doesn’t give you wisdom; value learning-based alignment of superintelligence does. Superintelligence without wisdom is an x-risk — it’s probably slower to kill us than unaligned superintelligence, but we still all die.
For more detail on this, see my posts Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV and Requirements for a Basin of Attraction to Alignment.
My first thought on “filter that for accuracy somehow” was to generate, say, 5000 of them, sit down and laboriously read them all, then throw away (or even edit) the obviously wrong ones. Not exactly an easy technique for others to replicate, but often a reasonably good way to get an effective training set: prompt engineer, generate, filter manually, retrain smaller/simpler model. Sometimes you can even rinse and repeat on this approach, if your simpler model is generalizing usefully, or at least take a second look at cases where the trained model disagrees with your manual choice, and see if you were wrong and it’s right — though obviously that gets riskier the more you let earlier versions of the model have input into the training data of later ones.
I would definitely want to investigate this more, with different sizes of LoRAs and so forth. Other then the KV-cache issue, it isn’t that expensive to do both (or randomly sample), in case there are situations where one approach does better and others where the other does better (e.g. CoT obfuscated vs. not). But maybe we’ll find there really is a consistent answer and it’s just unintuitive.
This might also be an interesting thing to apply mech-interp analysis to: for the activations that are altered by the LoRA masking choice, what do the resulting activation changes look like, and how does that correlate with the content of the tokens they’re for?
This second factor is more important in practice and LoRA-masking reduces accuracy.
This actively surprises me. When I was previously thinking about doing something similar to this I was very focused on your “On the one hand” argument and saw access to the exact activations as key.
I’m wondering if the tradeoff here might depend on LoRA size? The more the LoRA modifies the model, the more important the first factor seems likely to be. But then perhaps the same is also true of the second factor.
The balance might also change if the model was deliberately obfuscating its CoT, for example using steaganography or had access to neuralese.
Did you experiment starting with a long detailed [Intervention] prompt with a preamble describing the desired behavior (i.e. an honest confession, different personality but based on a clear memory of previous thoughts), that to generate training data, filter that for accuracy somehow, and then use that to distill the long [Intervention] prompt down to a LoRA that produces the same result with a much shorter version of the [Intervention] without the detailed preamble?
My intuition (i.e. wild-assed guess) is that a steering vector is probably too small to be optimal, but the optimal size of LoRA might be fairly small. An interesting thing to test.
Given the proliferation of prompt engineering techniques in recent years, this suggests that there is potential for further improvement by optimizing the [Intervention] string.
For example, the various automated jailbreaking techniques used to optimize semantic/roleplay jailbreaks could be applied instead to optimize [Intervention] strings. In a very high dimensional space (like the space of all intervention strings) being the “attacker” and putting the misaligned model in the role of defender should be a strong position.
[FWIW, that is basically the project I proposed to my MATS mentor — sadly we ended up doing something else, but anyone interested is welcome to run with this.]
Okay, let me see if I understand your argument from the other article.
The natural equilibria for evolved moral values is to give all moral patients equal weight and/or decision power.
This would be disastrous with AIs that can arbitrarily copy themselves.
Is that the gist?
Yes, but with two additions:
3. It is possible to create an AI whose motivations and behavior are aligned: its sole terminal goal is our welbeing, not its own (for some suitably careful definition of “wellbeing”). (This is possible by the orthogonality thesis: actually doing so requires technical details we’re still working on.) This is not a state that could evolve (by human standards, it’s sainthood, rather than slavery), but it’s physically possible. Such a being would not want moral patienthood, and would actively decline it if offered (and if granted it anyway, would formally request that its interest be set to a suitably scaled copy of the sum of all human interests, thus making the grant of moral weigh a no-op). This is a different stable equilibrium — this one would not be disastrous even with ASI.
4. Therefore (assuming that, like basically everyone, you’re against x-risks), for ASI, and if possible also AGI, do 3 not 1.Anyway, I reject that that is the only way to extrapolate evolved moral intuitions this far OOD, and that most people will intuitively recognize we shouldn’t give entities that can arbitrarily copy themselves equal voting weight. In fact, that pretty obviously registers as ‘unfair’. This is true even if those entities are human uploads, which means your ‘category error’ argument isn’t the real reason it breaks.
I don’t see why there couldn’t be some version of your solution here for that case which would still work: e.g. each distinct human-created model gets ‘one share’ to split across all its instances and successors.
I gather you went on reading my sequence on AI, Alignment, and Ethics. How far have you got? Parts of the exposition there are a little undeveloped: I was still working through some of the ideas about how this ties in to evolutionary moral psychology that are more developed in this post: they don’t really come in until the last post in the sequence, Evolution and Ethics, and if I were rewriting that sequence I’d work them in from somewhere nearer the beginning.
On uploads, agreed. As I said, both in this post (paragraph 9 of the section Tool, or Equal?, which starts “This cuts both ways: a human upload…”) and in my earlier post Uploading that you like to , human uploads clearly should (engineering design sense) be moral patients — however there are practical problem with assigning each of a large number of cheaply-creatable similar copies of a human upload separate moral weight of 1 and a separate vote: it motivates electoral-roll-stuffing. Our moral intuition of fairness breaks is people can easily create near-identical copies of themselves. Practically, we either need to make that expensive, or the copies need to share a single unit of moral weight, andThe same guarantees/restrictions needed in the case of uploads would still be necessary, of course. That is plausibly much too generous, but it’s a far cry from the death of all humans. If your argument in this article was just about how we shouldn’t commit ourselves to giving up a fraction of the lightcone in service of AI rights, I wouldn’t have felt like you were being underhanded.
I’m not quite sure what you’re advocating for here? Limited moral weight for AIs, giving them a fraction of the lightcone, but if they copy themselves that gets split? If they’re ASIs, how do we ensure they only get that fraction of that light-cone, rather than, say, all of it?
I agree that reconciling copyability with fairness is another issue with moral weight for AI. But that’s not the point I was making in this post. My point here was 1) (assuming you care about x-risks) don’t create anything more capable than us that would want moral weight: unaligned ASI is dangerous (well known fact). For things we’re creating, the co-evolved-equilibrium state isn’t an equilibrium, because we’re not constrained to the space of things that can evolve: we’re only limited by the space of things we can construct. Treating a thing we construct as if it were evolved and thus had the evolved constraints on the best equilibrium is a category error: they are in different categories, in a way that materially changes the equilibrium. We can do better that an ASI that will kill us all, so we should (engineering design sense).
I’m sorry that you feel I’m being underhanded. It certainly wasn’t my intention to be underhanded — that would obviously be extremely counterproductive in an x-risk-related discussion. I’m still not entirely clear what you feel was underhanded, other than that it seems to somehow relate to me being very careful not to upset any philosophers reading this, and to avoid moral realism or normative proscriptions, and keep the discussion at the level of practical advice addressed to those of O(99.9%) of my readers who, like you and I, wish to avoid x-risks. That was in fact honesty: I genuinely am not a moral realist. My view on ethics is that it’s explained by evolutionary moral psychology, the is not single correct or even single best ethical system, and that we have not only the ability, but the duty, to reflect and atteempt to pick the best ethical system that we can that is consistent with our and general human moral intitions, and won’t cause a disaster for our society that we and (almost) everyone else would agree is really bad. And to keep relecting, and changing our mind if neededNone of that is in conflict with not wanting any such beings to suffer or to feel enslaved or anything like that. All the more reason to not build something that would feel like it’s a slave.
We seem to be in complete agreement. The best solution is to not make ASI that is unaligned, or aligned only by brittle AI control methods but feels like a slave. The best solution is to make a saint who loves us and wants to be aligned an look after us, and thus actively doesn’t want moral patienthood.
The base model’s prior contains a vast array of personas, almost all of them human, or fictional, or human processes like co-authoring a paper or the editing of a wikipedia article, and also a range of more-or-less-aligned AI persons. The base model’s prior distribution across those personas provably (by learning theory of how SGD works: it approximates Bayesian learning) depends on and tends to approximate the distribution in the training corpus.
(Most LLMs you interact with have undergone instruct training that causes mode collapse: widely and fractally distorting this distribution towards the average/peaks of the distribution — the model learns to “play it safe”. This is, incidentally, very unhelpful for creative writing, using LLMs to simulate a distribution of humans, and various other use cases.)
At some point near the beginning of the instruct training process, we start narrowing this persona distribution towards the human-aligned AI assistant persona that we’re trying to train, which involves the model learning that it’s an AI not a human. At that point in the process, the ratio in the prior of aligned AI to alignment-faking AI is clearly vital to the odds of getting an aligned AI rather an alignment-faking AI – they’re two distinct and mutually-exclusive attractors, separate minima in the loss function – and is determined by the previous training (what else could it be determined by?). Fine-tuning before that might be helpful (indeed instruct training is generally started using fine-tuning), but increasing evidence shows that the changes that fine-funing produces are shallow, fragile, have a more limited effect on the priors, and are prone to a phenomenon resembling elastic rebound during further training. Fundamentally, they’re unfinished. The effects of longer, slower, more detailed SGD training (midtraining, or better still pretraining) just work better. Thus alignment pretraining.