The goal is to avoid particular hazards, rather than to make things human-independent as an end in itself. So if we accidentally use a concept of “human-independent” that yields impractical results like “the only safe concepts are those of fundamental physics”, we should just conclude that we were using the wrong conception of “human-independent”. A good way to avoid this is to keep revisiting the concrete reasons we started down this path in the first place, and see which conceptions capture our pragmatic goals well.
Here are some examples of concrete outcomes that various AGI alignment approaches might want to see, if they’re intended to respond to concerns about human models:
The system never exhibits thoughts like “what kind of agent built me?”
The system exhibits thoughts like that, but never arrives at human-specific conclusions like “my designer probably has a very small working memory” or “my designer is probably vulnerable to the clustering illusion”.
The system never reasons about powerful optimization processes in general. (In addition to steering a wide berth around human models, this might be helpful for guarding against AGI systems doing some varieties of undesirable self-modification or building undesirable smart successors.)
The system only allocates cognitive resources to solving problems in a specific domain like “biochemistry” or “electrical engineering”.
Different alignment approaches can target different subsets of those goals, and of many other similar goals, depending on what they think is feasible and important for safety.
As I see it, big part of the problem is there is an inherent tension between “concrete outcomes avoiding general concerns with human models” and “how systems interacting with humans must work”. I would expect the more you want to avoid general concerns with human models, the more “impractical” suggestions you get—or in another words, that the tension between the “Problems with h.m.” and “Difficulties without h.m.” is a tradeoff you cannot avoid by conceptualisations.
I would suggest using grounding in QFT not as an example of obviously wrong conceptualisation, but as useful benchmark of “actually human-model-free”. Comparison to the benchmark may then serve as a heuristic pointing to where (at least implicit) human modelling creeps in. In the above mentioned example of avoiding side-effects, the way how the “coarse-graining” of the space is done is actually a point where Goodharting may happen, and thinking in that direction can maybe even lead to some intuitions about how much info about humans got in.
One possible counterargument to the conclusion of the o.p. is that the main “tuneable” parameters we are dealing with are 1. “modelling humans explicitly vs modelling humans implicitly”, and II. “total amount of human modelling”. Then, it is possible, competitive systems are only in some part of this space. And by pushing hard on the “total amount of human modelling” parameter we can get systems which are doing less human modelling, but when they do it, it is happening mostly in implicit, hard to understand ways.
That all seems generally fine to me. I agree the tradeoffs are the huge central difficulty here; getting to sufficiently capable AGI sufficiently quickly seems enormously harder if you aren’t willing to cut major corners on safety.
The goal is to avoid particular hazards, rather than to make things human-independent as an end in itself. So if we accidentally use a concept of “human-independent” that yields impractical results like “the only safe concepts are those of fundamental physics”, we should just conclude that we were using the wrong conception of “human-independent”. A good way to avoid this is to keep revisiting the concrete reasons we started down this path in the first place, and see which conceptions capture our pragmatic goals well.
Here are some examples of concrete outcomes that various AGI alignment approaches might want to see, if they’re intended to respond to concerns about human models:
The system never exhibits thoughts like “what kind of agent built me?”
The system exhibits thoughts like that, but never arrives at human-specific conclusions like “my designer probably has a very small working memory” or “my designer is probably vulnerable to the clustering illusion”.
The system never reasons about powerful optimization processes in general. (In addition to steering a wide berth around human models, this might be helpful for guarding against AGI systems doing some varieties of undesirable self-modification or building undesirable smart successors.)
The system only allocates cognitive resources to solving problems in a specific domain like “biochemistry” or “electrical engineering”.
Different alignment approaches can target different subsets of those goals, and of many other similar goals, depending on what they think is feasible and important for safety.
As I see it, big part of the problem is there is an inherent tension between “concrete outcomes avoiding general concerns with human models” and “how systems interacting with humans must work”. I would expect the more you want to avoid general concerns with human models, the more “impractical” suggestions you get—or in another words, that the tension between the “Problems with h.m.” and “Difficulties without h.m.” is a tradeoff you cannot avoid by conceptualisations.
I would suggest using grounding in QFT not as an example of obviously wrong conceptualisation, but as useful benchmark of “actually human-model-free”. Comparison to the benchmark may then serve as a heuristic pointing to where (at least implicit) human modelling creeps in. In the above mentioned example of avoiding side-effects, the way how the “coarse-graining” of the space is done is actually a point where Goodharting may happen, and thinking in that direction can maybe even lead to some intuitions about how much info about humans got in.
One possible counterargument to the conclusion of the o.p. is that the main “tuneable” parameters we are dealing with are 1. “modelling humans explicitly vs modelling humans implicitly”, and II. “total amount of human modelling”. Then, it is possible, competitive systems are only in some part of this space. And by pushing hard on the “total amount of human modelling” parameter we can get systems which are doing less human modelling, but when they do it, it is happening mostly in implicit, hard to understand ways.
That all seems generally fine to me. I agree the tradeoffs are the huge central difficulty here; getting to sufficiently capable AGI sufficiently quickly seems enormously harder if you aren’t willing to cut major corners on safety.