Abstractions are not Natural

Alfred Harwood4 Nov 2024 11:10 UTC

25 points

Natural Abstraction Agent Foundations World Modeling AI

(This was inspired by a conversation with Alex Altair and other fellows as part of the agent foundations fellowship, funded by the LTFF)

(Also: after I had essentially finished this piece, I was pointed toward the post Natural abstractions are observer-dependent which covers a lot of similar ground. I’ve decided to post this one anyway because it comes at things from a slightly different angle.)

Here is a nice summary statement of the Natural Abstractions Hypothesis (NAH):

The Natural Abstraction Hypothesis, proposed by John Wentworth, states that there exist abstractions (relatively low-dimensional summaries which capture information relevant for prediction) which are “natural” in the sense that we should expect a wide variety of cognitive systems to converge on using them.

I think that this is not true and that whenever cognitive systems converge on using the same abstractions this is almost entirely due to similarities present in the systems themselves, rather than any fact about the world being ‘naturally abstractable’.

I tried to explain my view in a conversation and didn’t do a very good job, so this is a second attempt.

To start, I’ll attempt to answer the following question:

Suppose we had two cognitive systems which did not share the same abstractions. Under what circumstances would we consider this a refutation of the NAH?

Systems must have similar observational apparatus

Imagine two cognitive systems observing the the same view of the world with the following distinction: the first system receives its observations though a colour camera and the second system receives its observations through an otherwise identical black-and-white camera. Suppose the two systems have identical goals and we allow them both the same freedom to explore and interact with the world. After letting them do this for a while we quiz them both about their models of the world (either by asking them directly or through some interpretability techniques if the systems are neural nets). If we found that the first system had an abstraction for ‘blue’ but the second system did not, would we consider this evidence against the NAH? Or more specifically, would be consider this evidence that ‘blue’ or ‘colour’ were not ‘natural’ abstractions? Probably not, since it is obvious that the lack of a ‘blue’ abstraction in the second system comes from its observational apparatus, not from any feature of ‘abstractions’ or ‘the world’.

More generally, I suspect for any abstraction formed by system making observations of a world, one could create a system which fails to form that abstraction when observing the same world by giving it a different observational apparatus. If you are not convinced, look around you and think about how many ‘natural’-seeming abstractions you would have failed to develop if you were born with senses significantly lower resolution than the ones you have. To take an extreme example, a blind deaf person who only had a sense of smell would presumably form different abstractions to a sighted person who could not smell, even if they are in the same underlying ‘world’.

Examples like these, however, would (I suspect) not be taken as counterexamples to the NAH by most people. Implicit in the NAH is that for two systems to converge on abstractions they must be making observations using apparatus that is at least roughly similar in the sense that they must allow for approximately the same information to be transmitted from the environment to the system. (I’m going to use the word ‘similar’ loosely throughout this post).

Systems must be interacting with similar worlds

A Sentinel Islander probably hasn’t formed the abstraction of ‘laptop computer’, but this isn’t evidence against the NAH. He hasn’t interacted with anything that someone from the developed world would associate with the abstraction ‘laptop computer’ so its not surprising if he doesn’t converge on this abstraction. If he moved to the city, got a job as a programmer, used a laptop 10 hours a day and still didn’t have the abstraction of ‘laptop computer’, then it would be evidence against the NAH (or at least: evidence that ‘laptop computer’ is not a natural abstraction).

The NAH is not ‘all systems will form the same abstractions regardless of anything’ it is closer to ‘if two systems are both presented with the same/similar data, then they will form the same abstractions’ ^[1].

This means that ‘two systems failing to share abstractions’ is not evidence against the NAH unless both systems are interacting with similar environments.

Systems must be subject to similar selection pressure/constraints

Abstractions often come about through selection pressure and constraints. In particular, they are in some sense efficient ways of describing the world so are useful when computational power is limited. If one can perfectly model every detail of the world without abstracting then you can get away without using abstractions. Part of the reason that humans use abstractions is that our brains are finite sized and use finite amounts of energy.

To avoid the problems discussed in previous sections, lets restrict ourselves only to situations where systems are interacting with the same environment, through the same observation apparatus.

Here’s a toy environment:

Imagine a 6-by-6 pixel environment where pixels can either be white or black. An agent can observe the full 36 pixels at once. It can move a cursor (represented by a red dot) up,down,left right one square at a time and knows where the cursor is. Apart from moving the cursor, it can take a single action which flips the colour of the square that the cursor is currently on (from white to black or black to white). An agent will be left to its own devices for 1 million timesteps and then will be given a score equal to the number of black pixels in the top half of the environment plus the number of white pixels in the bottom half. After that, the environment is reset to a random configuration with the cursor in the top left corner and the game is played again.

It would be fairly straightforward to train a neural net to produce a program which would navigate this game/environment and get a perfect score. Here is one way in which a program might learn to implement a perfect strategy (apologies for bad pseudocode):

while(x)
  if cursor_position = (1,1) and pixel(1,1)= white:
 	flip_colour
 	move_nextpixel
 else if cursor_position = (1,1) and pixel(1,1)= black
 	move_nextpixel
 if cursor_position = (1,2) and pixel(1,2)= white:
 	flip_colour
 	move_nextpixel
 	...

… and so on for all 36 pixels. In words: this program has explicitly coded which action it should take in every possible situation (a strategy sometimes known as a ‘lookup table’). There’s nothing wrong with this strategy, except that it takes a lot of memory. If we trained a neural net using a process which selected for short programs we might end up with a program that looks like this:

while(x)
if cursor_position in top_half and pixel(cursor_position)=white:
	flip_colour
	move_nextpixel
else if cursor_position in top_half and pixel(cursor_position)=black:
	move_nextpixel

if cursor_position in bottom_half and pixel(cursor_position)=black:
	flip_colour
	move_nextpixel
else if cursor_position in bottom_half and pixel(cursor_position)=white:
	move_nextpixel

In this code, the program is using an abstraction. Instead of enumerating all possible pixels, it ‘abstracts’ them into two categories ‘top_half’ (where it needs pixels to be black) and ‘bottom_half’ (where it needs the pixels to be white) which keep the useful information about the pixels while discarding ‘low level’ information about the exact coordinates of each pixel (the code defining top_half and bottom_half abstraction is omitted but hopefully you get the idea).

Now imagine we trained two ML systems on this game, one with no selection pressure to produce short programs and the other where long programs were heavily penalised. The system with no selection pressure produces a program similar to the first piece of pseudocode, where each pixel is treated individually, and the system subject to selection pressure produces a program similar to the second piece of pseudocode which makes use of the top_half/bottom_half abstraction to shorten the code. The first system fails to converge on the same abstractions as the second system.

If we observed this, would it constitute a counterexample to the NAH? I suspect defenders of the NAH would reply ‘No, the difference between the two systems comes from the difference in selection pressure. If the two systems were subject to similar selection pressures, then they would converge on the same abstractions’.

Note that ‘selection pressure’ also includes physical constraints on the systems, such as constraints on size, memory, processing speed etc. all of which can change the abstractions available to a system. A tarantula probably doesn’t have a good abstraction for ‘Bulgaria’. Even if we showed it Bulgarian history lectures and in an unprecedented landslide victory, it was elected president of Bulgaria, it still wouldn’t have a good abstraction for ‘Bulgaria’ (or, if it did, it probably would not be the same abstraction as humans).

This would presumably not constitute a refutation of the NAH - it’s just the case that a spider doesn’t have the information-processing hardware to produce such an abstraction.

I’d vote for her (even if she inexplicably has a Serbian Eagle crest in her office)

So cognitive systems must be subject to similar physical constraints and selection pressures in order to converge on similar abstractions. To recap, our current version of the NAH states that two cognitive systems will converge on the same abstractions if

they have similar observational apparatus...
… and are interacting with similar environments...
...and are subject to similar selection pressures and physical constraints.

Systems must have similar utility functions ^[2]

One reason to form certain abstractions is because they are useful for achieving one’s goals/maximizing one’s utility function. Mice presumably have an abstraction for ‘the kind of vibrations a cat makes when it is approaching me from behind’ because such an abstraction is useful for fulfilling their (sub)goal of ‘not being eaten by a cat’. Honeybees have an abstraction of ‘the dance that another bee does to indicate that there are nectar-bearing flowers 40m away’ because it is useful for their (sub)goal of ‘collecting nectar’.

Humans do not naturally converge on these abstractions because they are not useful to us. Does this provide evidence against the NAH? I’m guessing but NAH advocates might say something like ‘but humans can learn both of these abstractions quite easily, just by hearing a description of the honeybee dance, humans can acquire the abstraction that the bees have. This is actually evidence in favour of the NAH—a human can easily converge on the same abstraction as a honeybee which is a completely different type of cognitive system’.

My response to this would be I will only converge on the same abstractions as honeybees if my utility function explicitly has a term which values ‘understanding honeybee abstractions’. And even for humans who place great value on understanding honeybee abstractions, it still took years of study to understand honeybee dances.

Is the NAH saying ‘two cognitive systems will converge on the same set of abstractions, provided that one of the systems explicitly values and works towards understanding the abstractions the other system is using’? If so, I wish people would stop saying that the NAH says that systems will ‘naturally converge’ or using other suggestive language which implies that systems will share abstractions by default without explicitly aiming to converge.

The honeybee example is a bit tricky because, on top of having different utility functions, honeybees also have different observational apparatus, and are subject to different selection pressures and constraints on their processing power. To get rid of these confounders, here is an example where systems are the same in every respect except for their utility functions and this leads them to develop abstractions.

Consider the environment of the pixel game from the previous section. Suppose there are two systems playing this game, which are identical in all respects except their utility functions. They observe the same world in the same way, and they are both subject to the same constraints, including the constraints that their program for navigating the world must be ‘small’ (ie. it cannot just be a lookup table for every possible situation). But they have different utility functions. System A uses a utility function $U_{A}$ , which gives one point for every pixel in the top half of the environment which ends up black and one point for every pixel in the bottom half of the environment which ends up white.

System B on the other hand uses a utility function $U_{B}$ , which gives one point for every pixel on the outer rim of the environment which ends up white and one point for every pixel in the inner 4x4 square which ends up black.

The two systems are then allowed to explore the environment and learn strategies which maximize their utility functions while subjected to the same constraints. I suspect that the system using $U_{A}$ would end up with abstractions for the ‘top half’ and ‘bottom half’ of the environment while the system using $U_{B}$ would not end up with these abstractions, because they are not useful for achieving $U_{B} .$ On the other hand, system B would end up with an abstraction corresponding to ‘middle square’ and ‘outer rim’ because these abstractions are useful for achieving $U_{B}$ .

So two systems require similar utility functions in order to converge on similar abstractions.

(Side note: utility functions are similar to selection pressure in that we can describe ‘having a utility function U’ as a selection pressure ‘selecting for strategies/cognitive representations which result in U being maximized’. I think that utility functions merit their own section for clarity but I wouldn’t be mad if someone wanted to bundle them up in the previous section.)

Moving forward with the Natural Abstraction Hypothesis

If we don’t want the NAH to be refuted by one of the counterexamples in the above sections, I will tentatively suggest some (not mutually exclusive) options for moving forward with the NAH.

Option 1

First, we could add lots of caveats and change the statement of the NAH. Recall the informal statement used at the start of this post:

there exist abstractions … which are “natural” in the sense that we should expect a wide variety of cognitive systems to converge on using them

We would have to modify this to something like:

there exist abstractions … which are “natural” in the sense that we should expect a wide variety of cognitive systems to converge on using them provided that those cognitive systems:
have similar observational apparatus,
and are interacting with similar environment,
and are subject to similar physical constraints and selection pressures,
and have similar utility functions.

This is fine, I guess, but its seems to me that we’re stretching the use of the phrase ‘wide variety of cognitive systems’ if we then put all of these constraints on the kinds of systems to which our statement applies.

This statement of the NAH is also dangerously close to what I would call the ‘Trivial Abstractions Hypothesis’ (TAH):

there exist abstractions … which are “natural” in the sense that we should expect a wide variety of cognitive systems to converge on using them provided that those cognitive systems:
are exactly the same in every respect.

which is not a very interesting hypothesis!

Option 2

The other way to salvage the NAH is to hone in on the phrase ‘there exist abstractions’. One could also claim that none of the counterexamples I gave in this post are ‘true’ natural abstractions. The hypothesis is just that some such abstractions exist. If someone takes this view, I would be interested to know: what are these abstractions? Are there any such abstractions which survive changes in observational apparatus/selection pressure/utility function ?

If I was in a combative mood ^[3] I would say something like: show me any abstraction and I will create a cognitive system which will not hold this abstraction by tweaking one of the conditions (observational apparatus/utility function etc.) I described above.

Option 3

Alternatively, one could properly quantify what ‘wide variety’ and ‘similar’ actually mean in a way which gels with the original spirit of the NAH. When I suggested that different utility functions lead to different abstractions, Alex made the following suggestion for reconciling this with the NAH (not an exact quote or fully fleshed-out). He suggested that for two systems, you could make small changes to their utility functions and this would only change the abstractions the systems adopted minimally, but to change the abstractions in a large way would require utility functions to be exponentially different. Of course, we would need to be be specific about how we quantify the similarity of two sets of abstractions and the similarity of two utility functions, but this provides a sense in which a set of abstractions could be ‘natural’ while still allowing different systems to have different abstractions. One can always come up with a super weird utility function that induces strange abstractions in systems which hold it, but this is in some sense would be contrived and complex and ‘unnatural’. Something similar could also be done in terms of quantifying other ways in which systems need to be ‘similar’ in order to share the same abstractions.

Has anyone done any work like this? In my initial foray into the natural abstractions literature I haven’t seen anything similar to this but maybe it’s there and I missed it. It seems promising!

Option 4

This option is to claim that the universe abstracts well, in a way that suggests some ‘objective’ set of abstractions but hold back on the claim that most systems will converge on this objective set of abstractions. As far as I can tell, a lot of the technical work on Natural Abstractions (such as work on Koopman-Pitman-Darmois) falls under this category.

There might be some objective sense in which general relativity is the correct way of thinking about gravity-that doesn’t mean all intelligent agents will converge on believing it. For example, an agent might just not care about understanding gravity, or be born before the prerequisite math was developed, or be born with a brain that struggles with the math required, or only be interested in pursuing goals which can be modelled entirely using Newton mechanics. Similarly, there might be an ‘objective’ set of abstractions present in the natural world, but this doesn’t automatically mean that all cognitive systems will converge on using this set of abstractions.

If you wanted to go further and prove that a wide variety of system will use this set of abstractions, you will then need to do an additional piece of work which proves that the objective set of abstractions is also the ‘most useful’ in some operational sense.

As an example of work in this direction, one could try proving that, unless a utility function is in terms of this objective/natural set of abstractions, it is not possible to maximize it (or at least: it is not possible to maximize it any better than some ‘random’ strategy). Under this view, it would still be possible to create an agent with a utility function which cared about ‘unnatural’ abstractions but this agent would not be successful at achieving its goals. We could then prove something like a selection theorem along the lines of ‘if an agent is actually successful in achieving its goals, it must be using a certain set of abstractions and we must be able to frame those goals in terms of this set of abstractions’.

This sounds interesting to me-I would be interested to hear if there is any work that has already been done along these lines!

^
A similar example, of someone who has never seen snow is given in section 1c here
^
I’m going to use ‘utility function’ in a very loose sense here, interchangeable with ‘goal’/‘reward function’/‘objective’. I don’t think this matters for the meat of the argument.
^
Don’t worry, I’m not!