Another argument against maximizer-centric alignment paradigms

expectation calibrator: freeform draft, posting partly to practice lowering my own excessive standards.

So, a few months back, I finally got around to reading Nick Bostrom’s Superintelligence, a major player in the popularization of AI safety concerns. Lots of the argument was stuff I’d internalized a long time ago, reading the sequences and even living with AI alignment researchers for a few months. But I wanted to see the circa-2015 risk-case laid out as thorougly as possible, partly as an excusion to learn more about my own subculture, and partly to see if I’d missed anything crucial to why all of us consider future advanced AI systems so dangerous.

But for the most part, the main effect of reading that book was actually to finally shatter a certain story I’d been telling myself about the AI doom scenario, the cracks in which had been starting to show for several months prior. Reading Superintelligence for myself made me notice just how much of AGI X-risk concern grew out of an extremely specific model of how advanced AI was actually going to work, one which had received very little supporting evidence (and even some countervailing evidence) since the rise of large language models.

Of course, I’m talking about how, when it comes to explaining why AI systems are risky, Bostrom’s book leans hard on expected utility maximization (like Yudkowsky’s sequences before it). Bostrom does step into discussions about other risks from time to time, for example pointing out wireheading as a potential failure mode of reinforcement learning systems (which are importantly not AIXI-style utility maximizers and pose an at least slightly different set of risks). However, most of the book’s discussion of AI risk frames the AI as having a certain set of goals from the moment it’s turned on, and ruthlessly pursuing those to the best of its ability.

It’d be easy to get sidetracked here with a debate about whether or not current AI systems are on-track to post that exact kind of x-risk. Indeed, modern deep learning systems don’t have explicit utility functions; instead they’re trained with loss functions, which are plausibly quite different in their dynamics.

We could talk about MIRI’s Risks from Learned Optimization paper and its arguments that deep learning systems could plausibly develop algorithms explicit optimization anyway. We could try to analyze that argument in more depth, or evaluate the empirical evidence for whether or not that’s actually going on in current deep learning systems. And I do have a lot of thoughts to share on precisely that topic. But, that’s not what I want to focus on right now. Instead, I want to make a point about the genealogy of this debate. What Bostrom’s book made me realize was that so much of AI x-risk concern came from a time and place where it was seemingly vastly more difficult to even imagine what general intelligence might look like, if not some kind of utility maximizer.

Case-in-point: Bostrom’s discussion of x-risks from so-called tool AIs being used as software engineers. This is a particularly valuable example because modern AI systems, like Claude 3.5 Sonnet, are already quite damn good at software engineering, so we actually have a reality against which to compare some of Bostrom’s speculations.

Here’s the vision Bostrom lays out for how AGI-enhanced software engineering might look.

With advances in artificial intelligence, it would become possible for the programmer to offload more of the cognitive labor required to figure out how to accomplish a given task. In an extreme case, the programmer would simply specify a formal criterion of what counts as success and leave it to the AI to find a solution. To guide its search, the AI would use a set of powerful heuristics and other methods to discover structure in the space of possible solutions. It would keep searching until it found a solution that satisfied the success criterion. The AI would then either implement the solution itself or (in the case of an oracle) report the solution to the user.

Notice that Bostrom imagines having to specify a formal criterion for what counts as a solution to one’s programming problem. This seems like a clear relic of a culture in which explicit utility maximizers were by far the most conceivable form advanced AI systems could take, but that’s largely changed by now. You can actually just use natural language to outline the code you want a Claude-like AI system to write, and it will do so, with a remarkable intuitive knack for “doing what you mean” (in contrast to the extreme rigidity of computer programs from decades past).

And this isn’t trivial in terms of its implications for Bostrom’s threat models. It actually rules out language models causing the first kind of catastrophe Bostrom imagines as following from AGI software engineers. Here’s him discussing the user-specified formal criterion as roughly a kind of utility function, which an AI software engineer might malignantly maximize.

There are (at least) two places where trouble could then arise. First, the superintelligent search process might find a solution that is not just unexpected but radically unintended. This could lead to a failure of one of the types discussed previously (“perverse instantiation,” “infrastructure profusion,” or “mind crime”). It is most obvious how this could happen in the case of a sovereign or a genie, which directly implements the solution it has found. If making molecular smiley faces or transforming the planet into paperclips is the first idea that the superintelligence discovers that meets the solution criterion, then smiley faces or paperclips we get.

Bostrom’s second threat model is that the AI would paperclip the world in the process of trying to produce a solution, for example to get rid of sources of interference, rather than as a side effect of whatever its solution actually is. This would also not come about as a result of a language model relentlessly striving to fulfill a user’s simple, natural-language request for a certain kind of computer program, contra Bostrom’s vision. If deep learning systems could be made to ruthlessly paperclip the universe, it would be as a result of some other failure-mode, like wireheading, mesa-optimization, or extremely malicious prompt engineering. It wouldn’t follow from the user mis-specifying an explicit utility function in the course of trying to use a language model normally.

The reason I’m bringing all of this up isn’t that I think it blows up the AI risk case entirely. It’s possible that current AI systems have their own, hidden “explicit goals” which will become dangerous eventually. And even if they don’t, it’s at least feasible that we might lose control of advanced AI systems for other reasons. However, the point I want to make is that many extremely high p(doom)s seem to have derived from a memeplex in which AGI that didn’t look like a von Neuman-rational paperclip maximizer was barely even conceivable, a condition which no longer holds and which should correspond to lesser fears about human extinction. To paint a clearer picture of what I mean, let’s go over some reasons to think that current deep learning systems might work differently than utility maximizers.

Firstly, modern AI researchers have developed at least partial stories for what might be going on inside deep learning systems. For instance, take the concept of word embeddings. At the very beginning of the algorithm run by language models, they turn each word from their input into a unique vector, embedded in a high-dimensional space. It turns out that, after the model has been trained a bit, these embeddings encode the meaning of each word in a quite intelligible manner. For example, if you take the vector for “uncle,” subtract from it the numbers for the “male” vector, and then add the numbers for the “female” vector, you’ll end up with a vector whose numbers are very close to the vector numbers for “aunt”.

We also have some idea of what the later layers of a GPT-like neural network might be doing, as they have these word vectors mathematically dance with each other. Lots of those later layers are so-called “self-attention blocks”, which perform a mathematical operation designed to let each word-vector the model is processing “soak up” meaning from previous words in the input (as encoded in the numbers of their own word-vectors), for example to help the model tell what person a pronoun refers to, or to help it infer the extra, unique meaning that words like “Bill” and “Clinton” should have when they show up next to each other. The model does this kind of thing over and over again as it sequentially processes the text you fed into it, and it could plausibly account for much of its general intelligence, without invoking utility maximization.

(By the way, for those who haven’t seen them, I highly recommend 3blue1brown’s recent videos on how transformer-based LLMs work. One of their strengths is that they this kind of “why it might work” story for many details of the transformer architecture, more than I’m covering here.)

Lastly, some dedicated interpretability research into (very small) transformers has uncovered some basic strategies they use for next-word prediction. Anthropic’s paper on so-called induction heads is one example: they show that at one point in the training process, their very small transformer models learned to start checking if the current string they were completing showed up earlier in the context window, and assuming that string would be completed the same way again.[1] Anthropic has another paper arguing that a scaled-up, fuzzier version of this process may be a major strategy used by larger transformer models.

And notably, like the other examples I’ve talked about here, this strategy does not look very much like AIXI-style “searching over worlds I might be in, and searching over actions I might take on each of those worlds to maximize expected value per my well-defined utility function.”

Again, I’m not bringing up these glimmers of interpretability in an effort to prove that current, advanced deep learning systems don’t implement something like mesa-optimization, or that no future deep learning system would do so, or that all future AI systems will be deep learning-based at all . There’s just not enough information to tell yet, and it’s very important to be careful considering the disastrous consequences such systems could have if we accidentally deployed them.

No, the real reason I bring this up to point out that in the 10 years since Superintelligence was published, and in the 20 years since MIRI was founded, it has become vastly more conceivable that a generally intelligent AI system might work by methods other AIXI-like expected utility maximization. If AI safety research were just starting up today, the probabilities we assigned naturally to advanced AI systems working like utility maximizers would probably be a lot lower than they were in SL4-ish spaces 20 years ago. We have acquired alternate, partial models of how advanced AI might work, so we don’t have to anchor so hard AIXI-inspired models when making predictions (even if their extreme danger still means they merit a lot of attention). In short, we ought to try seeing the alignment problem with fresh eyes.

This was the position which emerged from the ashes of my old views while reading Superintelligence. More specifically, it was the position I’d been lying to myself to avoid seeing for months; its emergence felt like a newborn phoenix amid the necrotic tissue of my past view. I’d spent lots of self-deceptive energy trying to muster up an inner story of why language models were likely to be extremely dangerous, in order to justify the high p(doom) I’d picked up from the sequences, and wanted to maintain to avoid embarrassment, and for fear of losing my ingroup status among rationalists if I became enough of an AI optimist.

(Hell, I was afraid of becoming someone who was adamant about merely ensuring AI safety despite already thinking it was likely, e.g. someone who focuses on preventing the future from being eaten by other AI failure modes such as deep learning systems choosing to wirehead, and then tiling the universe with computro-hedonium.)[2]

The loss-of-ingroup-status didn’t end up happening, really, since I still have a bunch of values and interests in common with most rationalists. But that didn’t stop the process of learning to let myself lower my p(doom) from feeling a little bit like I was leaving a religion. It’s possible that other rationalists’ strong predictions of doom come from their having better evidence than me, or being better at interpreting it. But I worry that at least in some cases, there’s also some sunk-cost fallacy going on, or fear of lonely dissent, or some other biases like that. I think both of those particular ones were present in my case. If you can relate, I want to tell you that it’s okay to reconsider your beliefs.

(For the sake of completeness, I’ll note that there’s probably some wishful thinking and contrarianism powering my current outlook. But I think there’s a true, meaningful update in here too.)

Anyway, AI safety is important enough that I want to keep working on it, despite my relative optimism. With any luck, this update on my part might even make me a more motivated researcher, and perhaps a more convincing communicator of the fact that alignment poses some legitimate challanges.

Overall, this arc of mine has has felt like a kind of shadow integration, a practice which is supposed to have its benefits.

  1. ^

    I’d like to note the rapid improvement this “insight” caused in the models’ performance. From the outside, one might see the loss curve going down very fast and think “oh god oh fuck it’s FOOMING, it has become a MESA-OPTIMIZER”. But Anthropic’s analysis gives a very concrete example of something else that might be going on there. This may be relevant for dampening future risk forecasts.

  2. ^

    I’m especially interested in safety research that treats language models as at least somewhat neuromorphic or brain-like. Here’s a whole essay I wrote on that topic; it explores analogies like predictive learning in ML and predictive processing in humans, RL in both types of systems, and consciuosness as a type of context window.

    If deep learning systems are sufficiently neuromorphic, we might be able to import some insights from human alignment into AI safety research. For example, why don’t most humans like wireheading? If we figured that out, it might help us ensure deep learning systems strive to avoid doing so themselves.