I’m a staff artificial intelligence engineer working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I’m actively looking for employment working in this area, preferably in the UK — meanwhile I’ll be participating in SERI MATS summer 2025. I will also be attending LessOnline.
RogerDearnaley
The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
Fair enough — then I’ll add that to my list of posts to write.
I’d missed that, and I agree it makes a huge difference.
However, I don’t think a culture that isn’t willing to pause AI development entirely would accept you proposal.
Humans are living, evolved agents. They thus each individually have a set of goals they attempt to optimize: a preference ordering on possible outcomes. Evolution predicts that, inside the distribution the creature evolved in, this preference ordering will be nearly as well aligned to the creature’s evolutionary fitness as is computationally feasible for the creature.
This is the first step in ought-from-is: it gives us a preference ordering, which if approximately coherent (i.e. not significantly Dutch-bookable — something evolution seems likely to encourage) implies an approximate utility function — a separate one for each human (or other animal). As in “this is what I want (for good evolutionary reasons)”. So, using agent fundamentals terminology, the answer to the ought-from-is question “where does the preference ordering on states of the world come from?” is “every evolved intelligent animal is going to have a set of evolved and learned behaviors that can be thought of as encoding a preference ordering (albeit one that may not be completely coherent, to the extent that it only approximately fulfills the criteria for the coherence theorems). ” [It even gives us a scale on the utility function, something a preference ordering doesn’t give us, in terms of the approximat effect on the evolutionary fitness of the organism: which ought correlate fairly well with the effort the organism is willing to put in to optimizing the outcome. This solves things like the utility monster problem.]
So far, that’s just Darwinism, or arguably the subfield Evolutionary Psychology, since it’s about the evolution of behavior. And so far the preference ordering “ought” is “what I want” rather than an ethical system, so arguably doesn’t yet deserve the term “ought” — I want to have a billion dollars, but saying that I thus “ought” to have a billion dollars is a bit of a stretch linguistically. Arguably so far we’ve only solved “want-from-is”.
Evolutionary Ethics goes on to explain why humans, as an intelligent social animal, are evolved to have a set of moral instincts that lets them form a set of conventions for compromises between the preference orderings of all the individual members of a tribe or other society of humans, in order to reduce intra-group conflicts by forming a “social compact” (to modify Hobbes’ terminology slightly). For example, the human sense of fairness encourages sharing of food from successful hunting or gathering expeditions, our habit of forming friendships produces alliances, and so forth. The results of this are not exactly a single coherent preference ordering on all outcomes for the society in question , let alone a utility function, more a set of heuristics on how the preference orderings of individual tribal members should be reconciled (‘should’ is here being used in the sense that, if you don’t do this and other members of the society find out, there are likely to be consequences). In general, members of the society are free to optimize whatever their own individual preferences are, unless this significantly decreases the well-being (evolutionary fitness) of other members of the society. My business is mine, until it intrudes on someone else: but then we need to compromise.
So now we have a single socially agreed “ought” per society — albeit one fuzzier and with rather more internal structure than people generally encode into utility functions: it’s a preference ordering produced by a process whose inputs are many preference orderings, (and might thus be less coherent). This moral system will be shaped both by humans’ evolved moral instincts (which are mostly shared across members of our species, albeit less so by socipaths), as is predicted by evolutionary ethics, and also by sociological, historical and political processes.So, in philosophical terminology:
moral realism: no (However, human evolved moral instincts do tend to provide some simple consistent moral patterns across human societies, as long as you qualify all your moral statements with the rider “For humans, …”. So one could argue for a sort of ‘semi-realism’ for some simple moral statements, like “incest is bad” — that has a pretty clear evolutionary basis, and is pretty close to socially universal.)
moral relativism: yes — per society, and for some basic patterns/elements for the entire human species, but with no guarantees that these would apply to a very different intelligent social species (though there might well be commonalities for good evolutionary reasons — anything with sexual reproduction and deleterious recessives is likely to evolve an incest taboo.).
Given Said Achmiz’s comment already has 11 upvotes and 2 agreement points, should I write a post explaining all this? I had thought it all rather obvious to anyone who looks into evolutionary ethics and thinks a bit about what this means for moral philosophy (as quite a number of moral philosophers have done), but perhaps not.
Security thinking:
We have a communication channel to a dangerously expansive and militaristic alien civilization, and excellent surveillance of them (we managed to obtain a copy of their Internet archive), so we know a lot about the current state of their culture. We can send them messages, but since they are paranoid they will basically always disregard these, unless they’re valid checkable mathematical proofs. We’re pretty sure that if we let them expand they will start an interstellar war and destroy us, so we need to crash their civilization, by sending them mathematical proofs. What do we send them? Assume our math is a millenium ahead of theirs, and theirs is about current-day.
Clearly I didn’t read your post sufficiently carefully. Fair point: yes, you did address that, and I simply missed it somehow. Yes, you did mean cryptographic protocols, specifically ones of the Merlin-Arthur form.
I suspect that your the exposition could be made clearer, or better motivate readers who are skimming LW posts to engage with it — but that’s a writing suggestion, not a critique of the ideas.
I encourage you to look into evolutionary ethics (and evolutionary psychology in general): I think it provides both a single, well-defined (though vague) ethical foundation and an answer to the “ought-from-is” problem. It’s a branch of science, rather than philosophy, so we are able to do better than just agreeing to disagree.
Sorry, ethics and AI is an interest of mine. But yes, discussing it on LW is often a good way to rack up disagreement points.
Then it appears I have misunderstood your arguments — whether that’s just a failing on my part, or suggests they could be better phrased/explained, I can’t tell you.
One other reaction of mine to your post: you repeatedly mention ‘protocols’ for the communication between principal and agent. This, to my ear, sounds a lot like cryptographic protocols, and I immediately want details and to do a mathematical analysis of what I believe about their security properties — but this post doesn’t actually provide any details of any protocols, that I could find. I think that’s a major part of what I’m getting a sense that the argument contains elements of “now magic happens”.
Perhaps some simple, concrete examples would help here? Even a toy example. Or maybe the word protocol is somehow giving me the wrong expectations?
I seem to be bouncing off this proposal document — I’m wondering if there are unexplained assumptions, background, or parts of the argument that I’m missing?
Others have argued that the line can’t stay straight forever because eventually AI systems will “unlock” all the abilities necessary for long-horizon tasks.
AI systems tend to fail longer tasks by getting stuck, and not managing to get unstuck. Humans get stuck too — sometimes (particularly after a good night’s sleep), we notice that we are stuck, take a step back, think about it, and on occasion figure out our mistake and unstick ourselves. Somethimes we go to a coworker, or our manager, or a mentor, talk to them, and they may point out the error that we’d made that got us stuck. And of course sometimes people stay stuck..
As a builder of agents, I’d love to try implementing each of these, in a single agent or between multiple of them as appropriate, and see if we can help agents get unstuck. Some of this might actually work. Or we can just keep increasing training set sizes and capabilities and hope they will continue to get stuck less often, Bitter Lesson style.
Probably it will be a bit of all of these things. So, like you, I wouldn’t expect there to be a threshold beyond which agents never get stuck or are always able to unstick themselves. But we haven’t been thinking seriously about this problem for very long, and I suspect we might make a quite rapid progress on it.
This tangent would however require a longer discussion on the proper interpretation of Sutton’s bitter lesson.
I’d love to read that post…
I see no obvious philosophical[10] reason to be pessimistic in the face of (even Knightian) uncertainty
In abstract, I would agree. But AIXI is an agentic optimizer. Assuming for the moment that there aren’t any other agentic optimizers in its environment that are either looking out for it or have an adversarial relationship to it, anything that it is Knightianly uncertain about it cannot effectively optimize. So the unoptimized results are likely to be about average, i.e. a lot worse than anything that AIXI is actively able to optimize. So a certain degree of pessimism is justified.
…back, I’ve now read the rest of the post. I remain unconvinced that “a mathematical framework that guarantees the principal cannot be harmed by any advice that it chooses consult from a slightly smarter advisor” is practicable, and I still think it’s an overstatement of what the rest of you post suggests might be possible — for example: statistical evidence suggesting that X is likely to happen is not a ‘guarantee’ of X, so I think you should rephrase that: I suspect I’m not going to be the only person to bounce off it. LessWrong has a long and storied history of people trying to generate solid mathematical proofs about the safety properties of things whose most compact descriptions are in the gigabytes, and (IMO) no-one has managed it yet. If that’s not in fact what you’re trying to attempt, I’d suggest not sounding like it is.
The rest of the post also to me reads rather as “and now magic may happen, because we’re talking to a smarter advisor, who may be able to persuade us that there’s a good reason why we should trust it”. I can’t disprove that, for obvious Vingean reasons, but similarly I don’t think you’ve proved that it will happen, or that we could accurately decide whether the advisor’s argument that it can be trusted can itself be trusted (assuming that it’s not a mathematical proof that we can just run through a proof checker, which I am reasonable confident will be impractical even for an AI smarter than us — basically because ‘harmed’ has a ridiculously complex definition: the entire of human values).
I think you might get further if you tried approaching this problem from the other direction. If you were a smarter assistant, how could you demonstrate to the satisfaction of a dumber principal that they could safely trust you, you will never give them any advice that could harm them, and that none of this is an elaborate trick that they’re too dumb to spot? I’d like to see at least a sketch of an argument for how that could be done.
The comment was written not long after I got to the paragraph that it comments on — I skimmed a few paragraphs past that point and then started writing that comment. So perhaps your arguments need to be reordered, because my response to that paragraph was “that’s obviously completely impractical”. At a minimum, perhaps you should add a forward reference along the lines of “I know this sounds hard, see below for an argument as to why I believe it’s actually feasible”. Anyway, I’m now intrigued, so clearly I should now read the rest of your post carefully, rather than just skimming a bit past that point and then switching to commenting…
Humans are generally fairly good at forming cooperative societies when we have fairly comparable amounts of power, wealth, and so forth. But we have a dreadful history when a few of us are a lot more powerful than others. To take an extreme example, very few dictatorships work out well for anyone but the dictator, his family, buddies, and to a lesser extent henchmen.
In the presence of superintelligent AI, if that AI is aligned only to its current user, access to AI assistance is the most important form of power, fungible to all other forms. People with access to power tend to find ways to monopolize it. So any superintelligent AI aligned only to its current user is basically a dictatorship or oligarchy waiting to happen.
Even the current frontier labs are aware of that, and have written in corporate acceptable use policies and attempt to train the AI to enforce these and refuse to assist with criminal or unethical requests from the end-users. As AI become more powerful, nation-states are going to step in, and make laws about what AIs can do: not assisting with breaking the law seems a very plausible first candidate, and is a trivial extension opf existing laws around conspiracy.
Any practical alignment scheme is going to need to be able to cope with this case, where the principal is not a single user but a hierarchy of groups each imposing certain vetoes and requirements, formal or ethical, on the actions of the group below it, down to the end user.
Terrence Tao (who should be position to know) was involved in evaluating the o1 model (which is by now somewhat dated). In the context of acting as a research assistant, he described it as equivalent to a “mediocre but not entirely incompetent” graduate student. That’s not “better than basically all human mathematicians” — but it’s also not so very far off, if it’s about as good as the grade of gradate students that Terrence Tao has access to as research assistants.
The protocol is based on a mathematical framework that guarantees the principal cannot be harmed by any advice that it chooses consult from a slightly smarter advisor.
This sounds to me like the hard part. “Harmed” has to be measured by human values, which are messy, complex, and fragile (and get modified under reflection), so not amenable to mathematical descriptions or guarantees. Probably the most compact description of human values theoretically possible is the entire human genome, which at nearly a gigabyte is quite complex. Making useful statements about this such as “guarantees the principal cannot be harmed by any advice that it chooses consult” is going to require processing that into a more usable form, which is going to make it a lot larger. That is far more data and complexity than we (or any AI comparable to us) can currently handle by any approach that can fairly be described as mathematical or that yields guarantees. I think you should to stop looking for mathematical guarantees of anything around alignment, and start looking at approaches that are more statistical, data-science, or engineering, and that might actually scale to a problem of this complexity.
Or, if you don’t think society should build ASI before we have provable mathematical guarantees that it’s safe, then perhaps you should be working on how to persuade people and countries that we need to pause AGI and ASI development?
At a minimum, I think you need to justify in detail why you believe that “a mathematical framework that guarantees the principal cannot be harmed by any advice that it chooses consult from a slightly smarter advisor” is something that we can practically create — I think many readers as going to consider that to be something that would be lovely to have but is completely impractical.
At the first step, we have an agent with endorsed values (for instance, a human, or possibly some kind of committee of humans, though that seems somewhat harder). This agent takes the role of principal.
You appear to be assuming that individual humans (or at least, a committee composed of them) are aligned. This simply isn’t true. For instance, Stalin, Pol Pot, and Idi Amin were all human, but very clearly not well aligned to the values of the societies they ruled. An aligned AI is selfless: it cares only about the well-being of humans, not about its own well-being at all (other than as an instrumental goal). This is not normal behavior for humans: as evolved intelligences, humans unsurprisingly have almost always have selfish goals and value self-preservation and their own well-being (and that of their relatives and allies), at least to some extent.
I think what you need to use as principal is a suitable combination of a) human society as a whole, as an overriding target, and b) the specific current human user, subject to vetos and overrides by a) in matters sufficiently important to warrant this. Human societies generally grant individual humans broad-but-limited freedom to do as they wish: the limits tend to start kicking in when this infringes on what other humans in the same society want, especially if it does so deleteriously from those others’ point of view. (In practice, the necessary hierarchy is probably even more complex, such as: a) humanity as a whole, b) a specific nation-state, c) an owning company, and d) the current user.)
I don’t see this as ruling your proposal out, but it does add significant complexity: the agent has a conditional hierarchy of principals, not a single unitary one, and will on occasion need to decide which of them should be obeyed (refusal training is a simple example of this).
Fair enough. As evolutionary ethics tells us, human moral/social instincts evolved in an environment that generally encouraged compromise within a tribe (or perhaps small group of currently-allied tribes). So humans often may act in social-compact-like ways without necessarily consciously thinking through all the reasons why this is generally a good (i.e. evolutionarily adaptive) idea. So I guess I’m used to thinking of a social compact as not necessarily an entirely conscious decision. The same is in practice implicit in the earlier version of the phrase “social contract”, which I gather is from Hobbes (I prefer ‘compact’ rather than ‘contract’ exactly because it sounds less formal and consciously explicit). But I agree that many readers may not be used to thinking this way, and thus the phrase is potentially confusing — though you had made the point in the previous sentence that you were discussing subconscious behavior.
How about “pro-social behavior”? I think that’s a little more neutral about the extent to which it’s a conscious decision rather than an instinctual tendency. My main issue with “Stockholm Syndrome” is that it makes what you’re talking about seem aberrant and maladaptive.
Anyway, I’m basically just nitpicking your phrasing — to be clear, I agree with your ideas here, I’m just trying to help you better explain them to others.
Their results are for document embeddings (which are often derived from LLMs), not internal activation spaces in LLMs. But I suspect if we tested their method for internal activation spaces of different LLMs, at least ones of similar sizes and architectures, then we might find similar results. Someone really should test this, and publish the paper: it should be pretty easy to replicate what they did and plug various LLM embeddings in.
If that turns out to be true, to a significant extent, this seems like it should be quite useful for:
a) understanding why jailbreaks often transfer fairly well between models
b) supporting ideas around natural representations
c) letting you do various forms of interpretability in one model and then searching for similar circuits/embeddings/SAE features in other models
d) extending technique like the logit lens
e) comparing and translating between LLM’s internal embedding spaces and the latent space inherent in human language (their result clearly demonstrates that the is a latent space inherent in human language). This is a significant chunk of the entire interpretability problem: it lest us see inside the black box, so that’s a pretty key capability.
f) if you have a translation between two models (say of their activation vectors at their midpoint layer), then by comparing roundtripping from model A to model B and back to just roundtripping from model A to the shared latent space and back, you can identify what concepts model A understands that model B doesn’t. Similarly in the other direction. That seems like a very useful ability.
Of course, their approach requires zero information about which embeddings for model A correspond to or are similar to which embeddings for model B: their translation model learns all that from patterns in the data — rather well, according to their results. However, it shouldn’t be hard to supplement their approach, given that you often do have partial information about this, and have it also make use of the structures inherent in the data.