For an agent (wrapper-mind), there is still a sharp distinction between goal and outcome. This is made more confusing by the fact that a plausible shape of a human-aligned goal is a living civilization (long reflection) figuring out what it wants to actually happen, and that living civilization looks like an outcome, but it’s not. If its decision is that the outcome should look differently from a continued computation of this goal, because something better is possible, then the civilization doesn’t actually come into existence, and perhaps won’t incarnate significant moral worth, except as a byproduct of having its decisions computed, which need to become known to be implemented. This is different from a wrapper-mind actively demolishing a civilization to build something else, as in this case the civilization never existed (and was never explicitly computed) in the first place. Though it might be impossible to make use of decisions of a computation without explicitly computing it in a natural sense, forcing any goal-as-computation into being part of the outcome. The goal, unlike the rest of the outcome, is special in not being optimized according to the goal.
My current guess at long reflection that is both robust (to errors in initial specification) in pursuing/preserving/extrapolating values and efficient to compute is a story simulation, what you arrive at by steelmanning civilization that exists as a story told by GPT-n. When people are characters in many interacting stories generated by a language model, told in sufficient detail, they can still make decisions, as they are no more puppets to the AI than we are puppets to the laws of physics, provided the AI doesn’t specifically intervene on the level of individual decisions. Formulation of values involves the language model learning from the stories it’s written, judging how desirable/instructive they are and shifting the directions in which new stories get written.
The insurmountable difficulty here is to start with a model that doesn’t already degrade human values beyond hope of eventual alignment, which motivated exact imitation of humans (or WBEs) as the original form of this line of reasoning. The downside is that exact imitation doesn’t necessarily allow efficient computation of its long term results without the choices of approximation in predictions themselves being dependent on values that this process is intended to compute (value-laden prediction).
For an agent (wrapper-mind), there is still a sharp distinction between goal and outcome. This is made more confusing by the fact that a plausible shape of a human-aligned goal is a living civilization (long reflection) figuring out what it wants to actually happen, and that living civilization looks like an outcome, but it’s not. If its decision is that the outcome should look differently from a continued computation of this goal, because something better is possible, then the civilization doesn’t actually come into existence, and perhaps won’t incarnate significant moral worth, except as a byproduct of having its decisions computed, which need to become known to be implemented. This is different from a wrapper-mind actively demolishing a civilization to build something else, as in this case the civilization never existed (and was never explicitly computed) in the first place. Though it might be impossible to make use of decisions of a computation without explicitly computing it in a natural sense, forcing any goal-as-computation into being part of the outcome. The goal, unlike the rest of the outcome, is special in not being optimized according to the goal.
My current guess at long reflection that is both robust (to errors in initial specification) in pursuing/preserving/extrapolating values and efficient to compute is a story simulation, what you arrive at by steelmanning civilization that exists as a story told by GPT-n. When people are characters in many interacting stories generated by a language model, told in sufficient detail, they can still make decisions, as they are no more puppets to the AI than we are puppets to the laws of physics, provided the AI doesn’t specifically intervene on the level of individual decisions. Formulation of values involves the language model learning from the stories it’s written, judging how desirable/instructive they are and shifting the directions in which new stories get written.
The insurmountable difficulty here is to start with a model that doesn’t already degrade human values beyond hope of eventual alignment, which motivated exact imitation of humans (or WBEs) as the original form of this line of reasoning. The downside is that exact imitation doesn’t necessarily allow efficient computation of its long term results without the choices of approximation in predictions themselves being dependent on values that this process is intended to compute (value-laden prediction).