Although the selection/coherence theorems imply that a mature civilization should act like a wrapper mind, I don’t think that the internals of such a civilization have to look wrapper-mind-like, with potential relevance for what kind of AGI design we should aim for. A highly-efficient civilization composed of human-mind-like AGIs collectively reaching some sort of bargaining equilibrium might ultimately be equivalent to the goal of some wrapper-mind, but directly building the wrapper-mind seems more dangerous because small perturbations in its utility function can destroy all value, whereas intuitively it seems that perturbations in the initial design of the human-mind-like AGI could still lead to a bargaining equilibrium that is valuable from our perspective. This reminds me a bit of how you can formulate classical physics as either locally-following Newton’s laws, or equivalently finding the action-minimizing path. In this case the two formulations are utilitarianism and contractarianism. The benefit of considering the two formulations is that axiologies that are simple/robust in one formulation might be highly unnatural in the other.
For an agent (wrapper-mind), there is still a sharp distinction between goal and outcome. This is made more confusing by the fact that a plausible shape of a human-aligned goal is a living civilization (long reflection) figuring out what it wants to actually happen, and that living civilization looks like an outcome, but it’s not. If its decision is that the outcome should look differently from a continued computation of this goal, because something better is possible, then the civilization doesn’t actually come into existence, and perhaps won’t incarnate significant moral worth, except as a byproduct of having its decisions computed, which need to become known to be implemented. This is different from a wrapper-mind actively demolishing a civilization to build something else, as in this case the civilization never existed (and was never explicitly computed) in the first place. Though it might be impossible to make use of decisions of a computation without explicitly computing it in a natural sense, forcing any goal-as-computation into being part of the outcome. The goal, unlike the rest of the outcome, is special in not being optimized according to the goal.
My current guess at long reflection that is both robust (to errors in initial specification) in pursuing/preserving/extrapolating values and efficient to compute is a story simulation, what you arrive at by steelmanning civilization that exists as a story told by GPT-n. When people are characters in many interacting stories generated by a language model, told in sufficient detail, they can still make decisions, as they are no more puppets to the AI than we are puppets to the laws of physics, provided the AI doesn’t specifically intervene on the level of individual decisions. Formulation of values involves the language model learning from the stories it’s written, judging how desirable/instructive they are and shifting the directions in which new stories get written.
The insurmountable difficulty here is to start with a model that doesn’t already degrade human values beyond hope of eventual alignment, which motivated exact imitation of humans (or WBEs) as the original form of this line of reasoning. The downside is that exact imitation doesn’t necessarily allow efficient computation of its long term results without the choices of approximation in predictions themselves being dependent on values that this process is intended to compute (value-laden prediction).
Although the selection/coherence theorems imply that a mature civilization should act like a wrapper mind, I don’t think that the internals of such a civilization have to look wrapper-mind-like, with potential relevance for what kind of AGI design we should aim for. A highly-efficient civilization composed of human-mind-like AGIs collectively reaching some sort of bargaining equilibrium might ultimately be equivalent to the goal of some wrapper-mind, but directly building the wrapper-mind seems more dangerous because small perturbations in its utility function can destroy all value, whereas intuitively it seems that perturbations in the initial design of the human-mind-like AGI could still lead to a bargaining equilibrium that is valuable from our perspective. This reminds me a bit of how you can formulate classical physics as either locally-following Newton’s laws, or equivalently finding the action-minimizing path. In this case the two formulations are utilitarianism and contractarianism. The benefit of considering the two formulations is that axiologies that are simple/robust in one formulation might be highly unnatural in the other.
For an agent (wrapper-mind), there is still a sharp distinction between goal and outcome. This is made more confusing by the fact that a plausible shape of a human-aligned goal is a living civilization (long reflection) figuring out what it wants to actually happen, and that living civilization looks like an outcome, but it’s not. If its decision is that the outcome should look differently from a continued computation of this goal, because something better is possible, then the civilization doesn’t actually come into existence, and perhaps won’t incarnate significant moral worth, except as a byproduct of having its decisions computed, which need to become known to be implemented. This is different from a wrapper-mind actively demolishing a civilization to build something else, as in this case the civilization never existed (and was never explicitly computed) in the first place. Though it might be impossible to make use of decisions of a computation without explicitly computing it in a natural sense, forcing any goal-as-computation into being part of the outcome. The goal, unlike the rest of the outcome, is special in not being optimized according to the goal.
My current guess at long reflection that is both robust (to errors in initial specification) in pursuing/preserving/extrapolating values and efficient to compute is a story simulation, what you arrive at by steelmanning civilization that exists as a story told by GPT-n. When people are characters in many interacting stories generated by a language model, told in sufficient detail, they can still make decisions, as they are no more puppets to the AI than we are puppets to the laws of physics, provided the AI doesn’t specifically intervene on the level of individual decisions. Formulation of values involves the language model learning from the stories it’s written, judging how desirable/instructive they are and shifting the directions in which new stories get written.
The insurmountable difficulty here is to start with a model that doesn’t already degrade human values beyond hope of eventual alignment, which motivated exact imitation of humans (or WBEs) as the original form of this line of reasoning. The downside is that exact imitation doesn’t necessarily allow efficient computation of its long term results without the choices of approximation in predictions themselves being dependent on values that this process is intended to compute (value-laden prediction).