If Wentworth is right about natural abstractions, it would be bad for alignment
This post was written as part of the AI safety Mentors and Mentees program. My Mentor is Jacques Thibodeau.
In this post, I will distinguish between two hypotheses that are often conflated. To disambiguate, I first suggest two different names for these hypotheses so I can talk about them separately:
The natural abstraction hypothesis (NAH):
There are natural ways to cut the world up into concepts. A lot of very different cognitive systems will naturally converge to these abstractions. So there is reason to believe that AIs will also form concepts of abstractions that humans use (nails, persons, human values….).
The Wentworthian abstractions hypothesis (WAH):
There are natural abstractions, and they are identified by the properties of a system, that are relevant for predicting how far away objects behave.
Notice how the first might be true, while the second might be completely off. Just as you can deny that Newtonian mechanics is true, without denying that heavy objects attract each other.
Why natural abstractions are thought to be good for alignment
If NAH turns out to be correct, this would simplify two problems in alignment.
1. Interpretability
If the AI uses the same abstractions as us, it is probably way easier to read its mind.
2. Pointing at things
If the AI forms the abstraction “diamond” itself, we could just point at that abstraction in the AI’s mind, and say: “maximize that one”, instead of trying to formulate what a diamond is rigorously. This was proposed in combination with shard theory to the diamond-alignment problem. If it would naturally form an abstraction of human values, alignment might be easier than we thought (alignment by default). We could point at that abstraction by training it in such a way, that it adheres to those values.
Wentworthian abstractions are about outer appearance, not inner structure
Wentworths hypothesizes that natural abstractions consist of information that is relevant from afar. Let’s take the example of a Nail. A particular Nail consists of billions of Atoms, and there is an overwhelming amount of facts that could be stated about these particles. However, when you see a Nail on the other side of the room, you only tend to think about some of these facts. Its elongated shape, its color, and its pointiness for example. Let’s say these are the only abstract information about these billions of particles that make you identify them as a nail (rather than the information that any particular atom is any particular place).
This means that certain abstractions are bared, from being Wentworthian abstractions: Inner structure. Because Wentworthian abstractions only consist of the information that is relevant far away, the specific way that an object looks like up close is only a Wentworthian abstraction in so far as it influences the properties that are relevant far away. For example, the nail consists of iron atoms and not gold. If the nail would be made of gold, it would have a different color and weight. However, this exact color and weight are not unique to iron. Presumably, you could mix other metals in such a way, that color and weight would be identical. Then this new metal would share all Wentworthian abstractions with the iron nail, even though having a completely different inner structure.
If WAH is true, this would be bad for alignment
I think, if the Wentworthian abstractions hypothesis is true, plans like “solving the diamond maximization problem” or “alignment by default” would fail. Here is why:
Let’s take the example of diamond maximization first:
Suppose NAH&WAH are true, and we have adequate interpretability tools.
You train an AI in an environment that contains diamonds. Due to natural abstractions, it forms a diamond abstraction. You can use your interpretability tools to see that it has a diamond abstraction. You give the AI reward if it is around diamonds/acquires diamonds. It forms a diamond-shard (learns to value diamonds terminally). You use your interpretability tools to verify: It does actually value diamonds. You let it loose in the world and it creates lots of diamonds.
If WAH is true, it will recognize/identify diamonds by the information about diamonds that are preserved over distance in a noisy environment: their hardness, their shininess, their density… It will value that abstraction, so it will steer the world in such a way that it contains things that look from afar to be hard, shiny, and dense. Importantly, the WAH predicts that the molecular pattern of diamonds (a carbon atom, covalently bound to 4 other carbon atoms) will not be part of the diamond abstraction. It is not in itself information, that is itself relevant over distance. Sure, the atomic structure determines the density, hardness, and shininess, but it might not be unique in those properties. The AI does not care about the atomic structure, it maximizes its abstraction of diamonds. So whether it will actually produce lots of diamonds depends on whether it finds some cheaper way to produce objects that are shiny, hard, and dense. Since the AI is smart, it will probably find a material with those properties that are easier to produce. So it will tile the universe with “diamonds″ and not with diamonds.
Let’s see if the same thing happens when we look at the human value case. You expose the AI to a lot of data about humans. It forms an abstraction of human values. You then program the AI in such a way that it behaves in accordance with those values. You use your interpretability techniques, to verify that the human values it identifies are plausible (something like: “It is good when Persons are happy,...”). The AI is then released and acts in accordance with those values.
The critical piece here is that the human values the AI identifies probably refer to other abstractions (such as persons). If the value is “It is good, when persons are happy”, then this value is only meaningfully executed when the AI has the proper definition of what ”Person” means. And what does the AI identify as a person? The WAH has an answer to that: Anything that has the proper effect far away. So anything that looks like a Person, behaves to a stimulus like a person and talks like a person. The WAH explicitly predicts that the inner computations that are typical for persons are not for themselves relevant to the AI’s person abstraction.
So the AI fills the universe with trillions of happy persons, living fulfilling lives full of Truth and Beauty. But if it ever finds some inner structure, that looks like a person from the outside (and speaks like a person, …) but is cheaper to produce and has a completely different inner structure (for example way simpler non-conscious GPT-style simulations of humans), it will happily tile the Universe with those “Persons”.
The WAH predicts that most minds will only form abstractions according to outward appearance, not according to their inner structure. So whenever we value a concept due to its inner structure, the WAH predicts that AIs that want to maximize that abstraction will be naturally misaligned. They will throw the inner structure under the bus, to better maximize the outward appearance.
How can we fix this?
I see two ways, in which it could be possible to build AIs that do care about inner structure.
Wentorthisn abstractions are not the correct framework for natural abstractions. It could be that most minds actually don’t form concepts purely by outer appearance. It might be that I misunderstood John Wentworth’s ideas or that his approach is wrong. What makes this possibility plausible is that humans seem to care about inner structure (for example Persons vs. Non-sentient “Persons”).
Mind designed, where caring about inner structure is hard-coded in. An example of this could be infrabayesian agents. In Infrabayesiansim the argument of the utility function is, which computations are running in the universe. This is only a function of inner structure, not outward appearance. Unfortunately, we do not know how to actually build an infrabayesian agent.
Good post; in particular good job distinguishing between the natural abstraction hypothesis and my specific mathematical operationalization of it.
The outer appearance vs inner structure thing doesn’t quite work the way it initially seems, for two reasons. First, long-range correlations between the “insides” of systems can propagate through time. Second, we can have concepts for things we haven’t directly observed or can’t directly observe.
To illustrate both of these simultaneously, consider the consensus DNA sequence of some common species of tree. It’s a feature “internal” to the trees; it’s mostly not outwardly-visible. And biologists were aware that the sequence existed, and had a concept for it, well before they were able to figure out the full sequence. So how does this fit with natural abstractions as “information relevant far away”? Well, because there’s many trees of that species which all have the roughly-the-same DNA sequence, and those trees are macroscopically far apart in the world. (And even at a smaller scale, there’s many copies of the DNA sequence within different cells of a single tree, and those can also be considered “far apart”. And going even narrower, if there were a single strand of DNA, its sequence might still be a natural abstraction insofar as it persists over a long time.)
Causally speaking, how is information about DNA sequence able to propagate from the “insides” of one tree to the “insides” of another, even when it mostly isn’t “outwardly” visible? Well, in graphical terms, it propagated through time—through a chain of ancestor-trees, which ultimately connects all the current trees with roughly-the-same sequence.
In my view you misunderstood JW’s ideas, indeed. His expression “far away relevant”/”distance” is not limited to spatial or even time-spatial distance. It’s a general notion of distance which is not fully formalized (work’s not done yet).
We have indeed concerns about inner properties (like your examples), and it’s something JW is fully aware. So (relevant) inner structures could be framed as relevant “far away” with the right formulation.
Upvoted, nice original thoughts in this post IMO!
A shot at the diamond-alignment problem does indeed rely on the formation of a diamond abstraction in the policy network. Two points:
I was purposefully not tackling the diamond maximization problem, and briefly noted that[1] in an early paragraph in the post (but probably I should state that more clearly). I think “make a crisp maximizer for a simple-seeming quantity” is another fatal assumption made by “classic” theory.
Very few people seem to note there’s an important difference. In his critique, Nate Soares seemed to write that I was trying to solve diamond maximization. I have reminded people that the post was not about diamond maximization, about five times this week alone. I should probably update the story.
My story doesn’t involve the designers surgically modifying the network to care about the diamond abstraction, which is what I read you as implying here. (are you? EDIT after reading further, I think you didn’t mean that. Maybe worth clarifying the quoted portion?)
Nitpick: I agree with the distinction between NAH and WAH, but I think this wouldn’t constitute an example of NAH true, WAH false. Wait, before I elaborate further—was the second sentence supposed to contain such an example?
Quoted:
For more intuitions for why I think pure x-maximization is anti-natural, see this subsection of Inner and outer alignment decompose one hard problem into two extremely hard problems.
Thanks a lot for the comment and correction :)
I updated “diamond maximization problem” to “diamond alignment problem”.
I didn’t understand your proposal to involve surgically inserting the drive to value “diamonds are good”, but instead systematically rewarding the agent for acquiring diamonds so that a diamond shard forms organically. I also edited that sentence.
I am not sure I get your Nitpick: “Just as you can deny that Newtonian mechanics is true, without denying that heavy objects attract each other.” was supposed to be an example of “The specific theory is wrong, but the general phenomenon which it tries to describe exists”. In the same way that I think Natural Abstractions exist but (my flawed understanding) of Wentworths theory of natural abstractions is wrong. It was not supposed to be an example of a natural abstraction itself.
Not clear to me that different internal structure with no observable external differences are relevant. Something that has all the external behaviour/interactions of diamond (functionally identical), but a different internal composition is for all practical purposes, diamond.
I think there’s a missing hypothesis here (it probably seemed obvious to you but is very not obvious to me).
The missing hypothesis is something like:
I reject that hypothesis. Accepting that hypothesis is accepting zombies.