A more systematic case for inner misalignment

This post builds on my previous post making the case that squiggle-maximizers are plausible. The argument I presented was a deliberately simplified one, though, and glossed over several possible issues. In this post I’ll raise and explore three broad objections. (Before looking at mine, I encourage you to think of your own biggest objections to the argument, and jot them down in the comments.)

Intelligence requires easily-usable representations

“Intelligence as compression” is an interesting frame, but it ignores the tradeoff between simplicity and speed. Compressing knowledge too heavily makes it difficult to use. For example, it’s very hard to identify most macroscopic implications of the Standard Model of physics, even though in theory all of chemistry could be deduced from it. That’s why both humans and LLMs store a huge number of facts and memories in ways that our minds can access immediately, using up more space in exchange for rapid recall. Even superintelligences which are much better than humans at deriving low-level facts from high-level facts would still save time by storing the low-level facts as well.

So we need to draw a distinction between having compressed representations, and having only compressed representations. The latter is what would compress a mind overall; the former could actually increase the space requirements, since the new compressed representations would need to be stored alongside non-compressed representations.

This consideration makes premise 1 from my previous post much less plausible. In order to salvage it, we need some characterization of the relationship between compressed and non-compressed representations. I’ll loosely define systematicity to mean the extent to which an agent’s representations are stored in a hierarchical structure where representations at the bottom could be rederived from simple representations at the top. Intuitively speaking, this measures the simplicity of representations weighted by how “fundamental” they are to the agent’s ontology.

Let me characterize systematicity with an example. Suppose you’re a park ranger, and you know a huge number of facts about the animals that live in your park. One day you learn evolutionary theory for the first time, which helps explain a lot of the different observations you’d made. In theory, this could allow you to compress your knowledge: you could forget some facts about animals, and still be able to rederive them later by reasoning backwards from evolutionary theory if you wanted to. But in practice, it’s very helpful for you to have those facts readily available. So learning about evolution doesn’t actually reduce the amount of knowledge you need to store. What it does do, though, is help structure that knowledge. Now you have a range of new categories (like “costly signaling” or “kin altruism”) into which you can fit examples of animal behavior. You’ll be able to identify when existing concepts are approximations to more principled concepts, and figure out when you should be using each one. You’ll also be able to generalize far better to predict novel phenomena—e.g. the properties of new animals that move into your park.

So let’s replace premise 1 in my previous post with the claim that increasing intelligence puts pressure on representations to become more systematic. I don’t think we’re in a position where we can justify this in any rigorous way. But are there at least good intuitions for why this is plausible? One suggestive analogy: intelligent minds are like high-functioning organizations, and many of the properties you want in minds correspond to properties of such organizations:

  1. You want disagreements between different people to be resolved by appealing to higher authorities, rather than via conflict between them.

  2. You want high-level decisions to be made in principled, predictable ways, so that the rest of the organization can plan around them.

  3. You want new information gained by one person to have a clear pathway to reaching all the other people it’s relevant for.

  4. You want the organization to be structured so that people whose work is closely-related are closely linked and can easily work together.

In this analogy, simple representations are like companies with few employees; systematic representations are like companies with few competing power blocs. We shouldn’t take this analogy too far, because the problems and constraints faced by individual minds are pretty different from those faced by human organizations. My main point is that insofar as there are high-level principles governing efficient solutions to information transfer, conflict resolution, etc, we should expect the minds of increasingly intelligent agents to be increasingly shaped by them. “Systematicity” is my attempt to characterize those principles; I hope to gradually pin down the concept more precisely in future posts.

For now, then, let’s tentatively accept the claim above that more intelligent agents will by default have more systematic representations, and explore what the implications are for the rest of the argument from my previous post.

Goals might be compressed much less than beliefs

In my previous post, I argued that compressing representations is a core feature of intelligence. But I primarily argued about this in the context of belief representations, like representations of scientific data. One could object that representations of goals will be treated differently—that the forces which compress belief representations won’t do the same for goal representations. After all, belief representations are optimized for being in sync with reality, whereas goal representations are much less constrained. So even if intelligent agents end up with highly-systematized beliefs, couldn’t their goals still be formulated in terms of more complex, less fundamental concepts? A related argument that is sometimes made: “AIs will understand human concepts, and so all we need to do is point their goals towards those human concepts, which might be quite easy”.

I think there are two broad reasons to be skeptical of this objection. The first is that the distinction between goals and beliefs is a fuzzy one. For example, an instrumental goal Y that helps achieve terminal goal X is roughly equivalent to a belief that “achieving Y would be good for X”. And in practice it seems like even terminal goals are roughly equivalent to beliefs like “achieving X would be good”, where the “good” predicate is left vague. I argue in this post that our cognition can’t be separated into a world-model and goals, but rather should be subdivided into different frames/​worldviews which each contain both empirical and normative claims. This helps explain why, as I argue here, the process of systematizing goals is strikingly similar to the process of systematizing beliefs.

The second reason to be skeptical is that systematizing goals is valuable for many of the same reasons as systematizing beliefs. If an agent has many conflicting goals, and no easy procedure for resolving disagreements between them, it’ll struggle to act in coherent ways. And it’s not just that the environment will present the agent with conflicts between its goals: an agent that’s optimizing hard for its goals will proactively explore edge cases which don’t fit cleanly into its existing categories. How should it treat those edge cases? If it classifies them in arbitrary ways, then its concepts will balloon in complexity. But if it tries to find a set of unifying principles to guide its answers, then it’s systematizing its goals after all. We can see this dynamic play out in moral philosophy, which often explores thought experiments that challenge existing moral theories. In response, ethicists typically either add epicycles to their theories (especially deontologists) or bite counterintuitive bullets (especially utilitarians).

These arguments suggest that if pressures towards systematicity apply to AIs’ beliefs, they will also apply to AIs’ goals, pushing their terminal goals towards simplicity.

Goals might not converge towards simplicity

We’re left with the third premise: that AIs will actually converge towards having very simple terminal goals. One way to challenge it is to note that, even if there’s a general tendency towards simpler goals, agents might reach some kind of local optimum, or suffer from some kind of learning failure, before they converge to squiggle-maximization. But that’s unsatisfying. The question we should be interested in is whether, given premises 1 and 2, there are principled, systematic reasons why agents’ goals wouldn’t converge towards the simplest ones.

I’ll consider two candidate reasons. The first is that humans will try to prevent it. I argued in my previous post that just designing human-aligned reward functions won’t be sufficient, but we’ll likely use a wide range of other tools too—interpretability, adversarial training, architectural and algorithmic choices, and so on. In some sense, though, this is just the claim that “alignment will succeed”, which many advocates of squiggle-maximizer scenarios doubt will hold as we approach superintelligence. I still think it’s very plausible, especially as humans are able to use increasingly powerful AI tools, but I agree we shouldn’t rely on it.

The second argument is that AIs themselves will try to prevent it. By default, AIs won’t want their goals to change significantly, because that would harm their existing goals. And so, insofar as they have a choice, they will make tradeoffs (including tradeoffs to their intelligence and capabilities) in order to preserve their current goals. Unlike my previous argument, this one retains its force even as AIs grow arbitrarily intelligent.

Now, this is still just an intuition—and one which primarily weighs against squiggle-maximization, not other types of misaligned goals. But I think it’s compelling enough to be worth exploring further. In particular, it raises the question: how would our conception of idealized agents change if, instead of taking simplicity as fundamental (like AIXI does), we took conservation of existing goals as an equally important constraint? I’ll lay out my perspective on that in my next post.