Short answer: some goals incentivize general intelligence, which incentivizes tracking lots of abstractions and also includes the ability to pick up and use basically-any natural abstractions in the environment at run-time.
Longer answer: one qualitative idea from the Gooder Regulator Theorem is that, for some goals in some environments, the agent won’t find out until later what its proximate goals are. As a somewhat-toy example: imagine playing a board game or video game in which you don’t find out the win conditions until relatively late into the game. There’s still a lot of useful stuff to do earlier on—instrumental convergence means that e.g. accumulating resources and gathering information and building general-purpose tools are all likely to be useful for whatever the win condition turns out to be.
As I understand this argument, even if an agent’s abstractions depend on its goals, it doesn’t matter because disparate agents will develop similar instrumental goals due to instrumental convergence. Those goals involve understanding and manipulating the world, and thus require natural abstractions. (And there’s the further claim that a general intelligence can in fact pick up any needed natural abstraction as required.)
That covers instrumental goals, but what about final goals? These can be arbitrary, per the orthogonality thesis. Even if an agent develops a set of natural abstractions for instrumental purposes, if it has non-natural final goals, it will need to develop a supplementary set of non-natural goal-dependent abstractions to describe them as well.
When it comes to an AI modeling human abstractions, it does seem plausible to me that humans’ lowest-level final goals/values can be described entirely in terms of natural abstractions, because they were produced by natural selection and so had to support survival & reproduction. It’s a bit less obvious to me this still applies to high-level cultural values (would anyone besides a religious Jew naturally develop the abstraction of kosher animal?). In any case, if it’s sufficiently important for the AI to model human behavior, it will develop these abstractions for instrumental purposes.
Going the other direction, can humans understand, in terms of our abstractions, those that an AI develops to fulfill its final goals? I think not necessarily, or at least not easily. An unaligned or deceptively aligned mesa-optimizer could have an arbitrary mesa-objective, with no compact description in terms of human abstractions. This matters if the plan is to retarget an AI’s internal search process. Identifying the original search target seems like a relevant intermediate step. How else can you determine what to overwrite, and that you won’t break things when you do it?
I claim that humans have that sort of “general intelligence”. One implication is that, while there are many natural abstractions which we don’t currently track (because the world is big, and I can’t track every single object in it), there basically aren’t any natural abstractions which we can’t pick up on the fly if we need to. Even if an AI develops a goal involving molecular squiggles, I can still probably understand that abstraction just fine once I pay attention to it.
This conflates two different claims.
A general intelligence trying to understand the world can develop any natural abstraction as needed. That is, regularities in observations / sensory data → abstraction / mental representation.
A general intelligence trying to understand another agent’s abstraction can model its implications for the world as needed. That is, abstraction → predicted observational regularities.
The second doesn’t follow from the first. In general, if a new abstraction isn’t formulated in terms of lower-level abstractions you already possess, integrating it into your world model (i.e. understanding it) is hard. You first need to understand the entire tower of prerequisite lower-level abstractions it relies on, and that might not be feasible for a bounded agent. This is true whether or not all these abstractions are natural.
In the first case, you have some implicit goal that’s guiding your observations and the summary statistics you’re extracting. The fundamental reason the second case can be much harder relates to this post’s topic: the other agent’s implicit goal is unknown, and the space of possible goals is vast. The “ideal gas” toy example misleads here. In that case, there’s exactly one natural abstraction (P, V, T), no useful intermediate abstraction levels, and the individual particles are literally indistinguishable, making any non-natural abstractions incoherent. Virtually any goal routes through one abstraction. A realistic general situation may have a huge number of equally valid natural abstractions pertaining to different observables, at many levels of granularity (plus an enormous bestiary of mostly useless non-natural abstractions). A bounded agent learns and employs the tiny subset of these that helps achieve its goals. Even if all generally intelligent agents have the same potential instrumental goals that could enable them to learn the same natural abstractions, without the same actual instrumental goals, they won’t.
As I understand this argument, even if an agent’s abstractions depend on its goals, it doesn’t matter because disparate agents will develop similar instrumental goals due to instrumental convergence. Those goals involve understanding and manipulating the world, and thus require natural abstractions. (And there’s the further claim that a general intelligence can in fact pick up any needed natural abstraction as required.)
That covers instrumental goals, but what about final goals? These can be arbitrary, per the orthogonality thesis. Even if an agent develops a set of natural abstractions for instrumental purposes, if it has non-natural final goals, it will need to develop a supplementary set of non-natural goal-dependent abstractions to describe them as well.
When it comes to an AI modeling human abstractions, it does seem plausible to me that humans’ lowest-level final goals/values can be described entirely in terms of natural abstractions, because they were produced by natural selection and so had to support survival & reproduction. It’s a bit less obvious to me this still applies to high-level cultural values (would anyone besides a religious Jew naturally develop the abstraction of kosher animal?). In any case, if it’s sufficiently important for the AI to model human behavior, it will develop these abstractions for instrumental purposes.
Going the other direction, can humans understand, in terms of our abstractions, those that an AI develops to fulfill its final goals? I think not necessarily, or at least not easily. An unaligned or deceptively aligned mesa-optimizer could have an arbitrary mesa-objective, with no compact description in terms of human abstractions. This matters if the plan is to retarget an AI’s internal search process. Identifying the original search target seems like a relevant intermediate step. How else can you determine what to overwrite, and that you won’t break things when you do it?
This conflates two different claims.
A general intelligence trying to understand the world can develop any natural abstraction as needed. That is, regularities in observations / sensory data → abstraction / mental representation.
A general intelligence trying to understand another agent’s abstraction can model its implications for the world as needed. That is, abstraction → predicted observational regularities.
The second doesn’t follow from the first. In general, if a new abstraction isn’t formulated in terms of lower-level abstractions you already possess, integrating it into your world model (i.e. understanding it) is hard. You first need to understand the entire tower of prerequisite lower-level abstractions it relies on, and that might not be feasible for a bounded agent. This is true whether or not all these abstractions are natural.
In the first case, you have some implicit goal that’s guiding your observations and the summary statistics you’re extracting. The fundamental reason the second case can be much harder relates to this post’s topic: the other agent’s implicit goal is unknown, and the space of possible goals is vast. The “ideal gas” toy example misleads here. In that case, there’s exactly one natural abstraction (P, V, T), no useful intermediate abstraction levels, and the individual particles are literally indistinguishable, making any non-natural abstractions incoherent. Virtually any goal routes through one abstraction. A realistic general situation may have a huge number of equally valid natural abstractions pertaining to different observables, at many levels of granularity (plus an enormous bestiary of mostly useless non-natural abstractions). A bounded agent learns and employs the tiny subset of these that helps achieve its goals. Even if all generally intelligent agents have the same potential instrumental goals that could enable them to learn the same natural abstractions, without the same actual instrumental goals, they won’t.