I’m afraid it is generally infeasible to avoid modelling humans at least implicitly. One reason for that is that basically any practical ontology we use is implicitly human. In a sense the only implicitly non-human knowledge is quantum field theory (and even that is not clear).
For example: while human-independent methods to measure negative side effects seem like human-independent, it seems to me lot of ideas about humans creep into the details. The proposals I’ve seen generally depend on some coarse-graining of states - you at least want to somehow remove time from the state, but generally you do coarse-graining based on …actually, what humans value. (If this research agenda would be trying to avoid implicit human models, I would expect people spending a lot of effort on measures of quantum entaglement, decoherence, and similar topics.)
The goal is to avoid particular hazards, rather than to make things human-independent as an end in itself. So if we accidentally use a concept of “human-independent” that yields impractical results like “the only safe concepts are those of fundamental physics”, we should just conclude that we were using the wrong conception of “human-independent”. A good way to avoid this is to keep revisiting the concrete reasons we started down this path in the first place, and see which conceptions capture our pragmatic goals well.
Here are some examples of concrete outcomes that various AGI alignment approaches might want to see, if they’re intended to respond to concerns about human models:
The system never exhibits thoughts like “what kind of agent built me?”
The system exhibits thoughts like that, but never arrives at human-specific conclusions like “my designer probably has a very small working memory” or “my designer is probably vulnerable to the clustering illusion”.
The system never reasons about powerful optimization processes in general. (In addition to steering a wide berth around human models, this might be helpful for guarding against AGI systems doing some varieties of undesirable self-modification or building undesirable smart successors.)
The system only allocates cognitive resources to solving problems in a specific domain like “biochemistry” or “electrical engineering”.
Different alignment approaches can target different subsets of those goals, and of many other similar goals, depending on what they think is feasible and important for safety.
As I see it, big part of the problem is there is an inherent tension between “concrete outcomes avoiding general concerns with human models” and “how systems interacting with humans must work”. I would expect the more you want to avoid general concerns with human models, the more “impractical” suggestions you get—or in another words, that the tension between the “Problems with h.m.” and “Difficulties without h.m.” is a tradeoff you cannot avoid by conceptualisations.
I would suggest using grounding in QFT not as an example of obviously wrong conceptualisation, but as useful benchmark of “actually human-model-free”. Comparison to the benchmark may then serve as a heuristic pointing to where (at least implicit) human modelling creeps in. In the above mentioned example of avoiding side-effects, the way how the “coarse-graining” of the space is done is actually a point where Goodharting may happen, and thinking in that direction can maybe even lead to some intuitions about how much info about humans got in.
One possible counterargument to the conclusion of the o.p. is that the main “tuneable” parameters we are dealing with are 1. “modelling humans explicitly vs modelling humans implicitly”, and II. “total amount of human modelling”. Then, it is possible, competitive systems are only in some part of this space. And by pushing hard on the “total amount of human modelling” parameter we can get systems which are doing less human modelling, but when they do it, it is happening mostly in implicit, hard to understand ways.
That all seems generally fine to me. I agree the tradeoffs are the huge central difficulty here; getting to sufficiently capable AGI sufficiently quickly seems enormously harder if you aren’t willing to cut major corners on safety.
I’m afraid it is generally infeasible to avoid modelling humans at least implicitly. One reason for that is that basically any practical ontology we use is implicitly human. In a sense the only implicitly non-human knowledge is quantum field theory (and even that is not clear).
For example: while human-independent methods to measure negative side effects seem like human-independent, it seems to me lot of ideas about humans creep into the details. The proposals I’ve seen generally depend on some coarse-graining of states - you at least want to somehow remove time from the state, but generally you do coarse-graining based on …actually, what humans value. (If this research agenda would be trying to avoid implicit human models, I would expect people spending a lot of effort on measures of quantum entaglement, decoherence, and similar topics.)
The goal is to avoid particular hazards, rather than to make things human-independent as an end in itself. So if we accidentally use a concept of “human-independent” that yields impractical results like “the only safe concepts are those of fundamental physics”, we should just conclude that we were using the wrong conception of “human-independent”. A good way to avoid this is to keep revisiting the concrete reasons we started down this path in the first place, and see which conceptions capture our pragmatic goals well.
Here are some examples of concrete outcomes that various AGI alignment approaches might want to see, if they’re intended to respond to concerns about human models:
The system never exhibits thoughts like “what kind of agent built me?”
The system exhibits thoughts like that, but never arrives at human-specific conclusions like “my designer probably has a very small working memory” or “my designer is probably vulnerable to the clustering illusion”.
The system never reasons about powerful optimization processes in general. (In addition to steering a wide berth around human models, this might be helpful for guarding against AGI systems doing some varieties of undesirable self-modification or building undesirable smart successors.)
The system only allocates cognitive resources to solving problems in a specific domain like “biochemistry” or “electrical engineering”.
Different alignment approaches can target different subsets of those goals, and of many other similar goals, depending on what they think is feasible and important for safety.
As I see it, big part of the problem is there is an inherent tension between “concrete outcomes avoiding general concerns with human models” and “how systems interacting with humans must work”. I would expect the more you want to avoid general concerns with human models, the more “impractical” suggestions you get—or in another words, that the tension between the “Problems with h.m.” and “Difficulties without h.m.” is a tradeoff you cannot avoid by conceptualisations.
I would suggest using grounding in QFT not as an example of obviously wrong conceptualisation, but as useful benchmark of “actually human-model-free”. Comparison to the benchmark may then serve as a heuristic pointing to where (at least implicit) human modelling creeps in. In the above mentioned example of avoiding side-effects, the way how the “coarse-graining” of the space is done is actually a point where Goodharting may happen, and thinking in that direction can maybe even lead to some intuitions about how much info about humans got in.
One possible counterargument to the conclusion of the o.p. is that the main “tuneable” parameters we are dealing with are 1. “modelling humans explicitly vs modelling humans implicitly”, and II. “total amount of human modelling”. Then, it is possible, competitive systems are only in some part of this space. And by pushing hard on the “total amount of human modelling” parameter we can get systems which are doing less human modelling, but when they do it, it is happening mostly in implicit, hard to understand ways.
That all seems generally fine to me. I agree the tradeoffs are the huge central difficulty here; getting to sufficiently capable AGI sufficiently quickly seems enormously harder if you aren’t willing to cut major corners on safety.