As I see it, big part of the problem is there is an inherent tension between “concrete outcomes avoiding general concerns with human models” and “how systems interacting with humans must work”. I would expect the more you want to avoid general concerns with human models, the more “impractical” suggestions you get—or in another words, that the tension between the “Problems with h.m.” and “Difficulties without h.m.” is a tradeoff you cannot avoid by conceptualisations.
I would suggest using grounding in QFT not as an example of obviously wrong conceptualisation, but as useful benchmark of “actually human-model-free”. Comparison to the benchmark may then serve as a heuristic pointing to where (at least implicit) human modelling creeps in. In the above mentioned example of avoiding side-effects, the way how the “coarse-graining” of the space is done is actually a point where Goodharting may happen, and thinking in that direction can maybe even lead to some intuitions about how much info about humans got in.
One possible counterargument to the conclusion of the o.p. is that the main “tuneable” parameters we are dealing with are 1. “modelling humans explicitly vs modelling humans implicitly”, and II. “total amount of human modelling”. Then, it is possible, competitive systems are only in some part of this space. And by pushing hard on the “total amount of human modelling” parameter we can get systems which are doing less human modelling, but when they do it, it is happening mostly in implicit, hard to understand ways.
That all seems generally fine to me. I agree the tradeoffs are the huge central difficulty here; getting to sufficiently capable AGI sufficiently quickly seems enormously harder if you aren’t willing to cut major corners on safety.
As I see it, big part of the problem is there is an inherent tension between “concrete outcomes avoiding general concerns with human models” and “how systems interacting with humans must work”. I would expect the more you want to avoid general concerns with human models, the more “impractical” suggestions you get—or in another words, that the tension between the “Problems with h.m.” and “Difficulties without h.m.” is a tradeoff you cannot avoid by conceptualisations.
I would suggest using grounding in QFT not as an example of obviously wrong conceptualisation, but as useful benchmark of “actually human-model-free”. Comparison to the benchmark may then serve as a heuristic pointing to where (at least implicit) human modelling creeps in. In the above mentioned example of avoiding side-effects, the way how the “coarse-graining” of the space is done is actually a point where Goodharting may happen, and thinking in that direction can maybe even lead to some intuitions about how much info about humans got in.
One possible counterargument to the conclusion of the o.p. is that the main “tuneable” parameters we are dealing with are 1. “modelling humans explicitly vs modelling humans implicitly”, and II. “total amount of human modelling”. Then, it is possible, competitive systems are only in some part of this space. And by pushing hard on the “total amount of human modelling” parameter we can get systems which are doing less human modelling, but when they do it, it is happening mostly in implicit, hard to understand ways.
That all seems generally fine to me. I agree the tradeoffs are the huge central difficulty here; getting to sufficiently capable AGI sufficiently quickly seems enormously harder if you aren’t willing to cut major corners on safety.