The point of the post is that these are strategically different kinds of value, wrapper-mind goal and human values. Complexity in case of humans is not evidence for the distinction, the standard position is that the stuff you are describing is complexity of extrapolated human wrapper-mind goal, not different kind of value, a paperclipper whose goals are much more detailed. From that point of view, the response to your post is “Huh?”, it doesn’t engage the crux of the disagreement.
Expecting wrapper-minds as an appropriate notion of human value is the result of following selection theoremreasoning. Consistent decision making seems to imply wrapper-minds, and furthermore there is a convergent drive towards their formation as mesa-optimizers under optimization pressure. It is therefore expected that AGIs become wrapper-minds in short order (or at least eventually) even if they are not immediately designed this way. If they are aligned, this is even a good thing, since wrapper-minds are best at achieving goals, including humanity’s goals. If aligned wrapper-minds are never built, it’s astronomical waste, going about optimization of the future light cone in a monstrously inefficient manner. AI risk probably starts with AGIs that are not wrapper-minds, yet these arguments suggest that the eventual shape of the world is given by wrapper-minds borne of AGIs if they hold control of the future, and their disagreement with human values is going to steamroll the future with things human values won’t find agreeable. Unaligned wrapper-minds are going to be a disaster, and unaligned AGIs that are not wrapper-minds are still going to build/become such unaligned wrapper-minds.
The crux of the disagreement is whether unaligned AGIs that are not wrapper-minds inevitably build/become unaligned wrapper-minds. The option where they never build any wrapper-minds is opposed by the astronomical waste argument, the opportunity cost of not making use of the universe in the most efficient way. This is not impossible, merely impossibly sad. The option where unaligned AGIs build/become aligned wrapper-minds requires some kindof miracle that doesn’t follow from usual understanding of how extrapolation of value/long reflection works, a new coincidence of AGIs with different values independently converging on the same or mutually agreeable goal for the wrapper-minds they would build if individually starting from a position of control, not having to compromise with others.
My guess: there’s a conflict between the mathematically desirable properties of an expected utility maximizer on the one hand, and the very undesirable behaviors of the AI safety culture’s most salient examples of expected utility maximizers on the other (e.g., a paperclip maximizer, a happiness maximizer, etc).
People associate the badness of these sorts of “simple utility functions” EU maximizers with the mathematical EU maximization framework. I think that “EU maximization for humans” looks like an optimal joint policy that reflects a negotiated equilibrium across our entire distribution over diverse values, not some sort of collapse into maximizing a narrow conception of what humans “really” want.
I think of “wrapper mind bad” as referring to the intuitive notion of a simple EU maximizer / paperclipper, which are very bad. Arguing that “EU maximization good” is, I think, true, but not quite getting at the intuition behind “wrapper mind bad”.
Given how every natural goal-seeking agent seems to be built on layers and layers of complex interactions, I have to wonder if “utility” and “goals” are wrong paradigms to use. Not that I have any better ones ready, mind.
The point is not that EU maximizers are always bad in principle, but that utility that won’t be bad is not something we can give to an AGI that acts as an EU maximizer, because it won’t be merely more complicated than the simple utilities from the obviously bad examples, it must be seriously computationally intractable, given by very indirect pointers to value. And optimizing according to an intractable definition of utility is no longer EU maximization (in practice, where compute matters), this framing stops being useful in that case.
It’s only useful for misaligned optimizers or in unbounded-compute theory that doesn’t straightforwardly translate to practice.
If you need to represent some computationally intractable object, there are many tricks available to approximate such an object in a computationally efficient manner. E.g., one can split the intractable object into modular factors, then use only those factors which are most relevant to the current situation. My guess is that this is exactly what values are: modular, tractable factors that let us efficiently approximate a computationally intractable utility function.
If you actually optimize according to an approximation, that’s going to goodhart curse the outcome. Any approximation must only be soft-optimized for, not EU maximized. A design that seeks EU maximization, and hopes for soft optimization that doesn’t go too far, doesn’t pass the omnipotence test.
Also, an approximation worth even soft-optimizing for should be found in a value-laden way, losing inessential details and not something highly value-relevant. Approximate knowledge of values helps with finding better approximations to values.
Although the selection/coherence theorems imply that a mature civilization should act like a wrapper mind, I don’t think that the internals of such a civilization have to look wrapper-mind-like, with potential relevance for what kind of AGI design we should aim for. A highly-efficient civilization composed of human-mind-like AGIs collectively reaching some sort of bargaining equilibrium might ultimately be equivalent to the goal of some wrapper-mind, but directly building the wrapper-mind seems more dangerous because small perturbations in its utility function can destroy all value, whereas intuitively it seems that perturbations in the initial design of the human-mind-like AGI could still lead to a bargaining equilibrium that is valuable from our perspective. This reminds me a bit of how you can formulate classical physics as either locally-following Newton’s laws, or equivalently finding the action-minimizing path. In this case the two formulations are utilitarianism and contractarianism. The benefit of considering the two formulations is that axiologies that are simple/robust in one formulation might be highly unnatural in the other.
For an agent (wrapper-mind), there is still a sharp distinction between goal and outcome. This is made more confusing by the fact that a plausible shape of a human-aligned goal is a living civilization (long reflection) figuring out what it wants to actually happen, and that living civilization looks like an outcome, but it’s not. If its decision is that the outcome should look differently from a continued computation of this goal, because something better is possible, then the civilization doesn’t actually come into existence, and perhaps won’t incarnate significant moral worth, except as a byproduct of having its decisions computed, which need to become known to be implemented. This is different from a wrapper-mind actively demolishing a civilization to build something else, as in this case the civilization never existed (and was never explicitly computed) in the first place. Though it might be impossible to make use of decisions of a computation without explicitly computing it in a natural sense, forcing any goal-as-computation into being part of the outcome. The goal, unlike the rest of the outcome, is special in not being optimized according to the goal.
My current guess at long reflection that is both robust (to errors in initial specification) in pursuing/preserving/extrapolating values and efficient to compute is a story simulation, what you arrive at by steelmanning civilization that exists as a story told by GPT-n. When people are characters in many interacting stories generated by a language model, told in sufficient detail, they can still make decisions, as they are no more puppets to the AI than we are puppets to the laws of physics, provided the AI doesn’t specifically intervene on the level of individual decisions. Formulation of values involves the language model learning from the stories it’s written, judging how desirable/instructive they are and shifting the directions in which new stories get written.
The insurmountable difficulty here is to start with a model that doesn’t already degrade human values beyond hope of eventual alignment, which motivated exact imitation of humans (or WBEs) as the original form of this line of reasoning. The downside is that exact imitation doesn’t necessarily allow efficient computation of its long term results without the choices of approximation in predictions themselves being dependent on values that this process is intended to compute (value-laden prediction).
Expecting wrapper-minds as an appropriate notion of human value is the result of following selection theoremreasoning. Consistent decision making seems to imply wrapper-minds, and furthermore there is a convergent drive towards their formation as mesa-optimizers under optimization pressure. It is therefore expected that AGIs become wrapper-minds in short order (or at least eventually) even if they are not immediately designed this way.
As I currently understand your points, they seem like not much evidence at all towards the wrapper-mind conclusion.
Why do you think consistent decision-making implies wrapper-minds?
What is “optimization pressure”, and where is it coming from? What is optimizing the policy networks? SGD? Are the policy networks supposed to be optimizing themselves to become wrapper minds?
wrapper-minds are best at achieving goals, including humanity’s goals
Seems doubtful to me, insofar as we imagine wrapper-minds to be grader-optimizers which globally optimize the output of some utility function over all states/universe-histories/whatever, or EU function over all plans.
As I currently understand your points, they seem like not much evidence at all towards the wrapper-mind conclusion.
There are two wrapper-mind conclusions, and the purpose of my comment was to frame the distinction between them. The post seems to be conflating them in the context of AI risk, mostly talking about one of them while alluding to AI risk relevance that seems to instead mostly concern the other. I cited standard reasons for taking either of them seriously, in the forms that make conflating them easy. That doesn’t mean I accept relevance of those reasons.
You can take a look at this comment for something about my own position on human values, which doesn’t seem relevant to this post or my comments here. Specifically, I agree that human values don’t have wrapper-mind character, as should be expressed in people or likely to get expressed in sufficiently human-like AGIs, but I expect that it’s a good idea for humans or those AGIs to eventually build wrapper-minds to manage the universe (and this point seems much more relevant to AI risk). I’ve maintained this distinction for a while.
Complexity in case of humans is not evidence for the distinction, the standard position is that the stuff you are describing is complexity of extrapolated human wrapper-mind goal, not different kind of value, a paperclipper whose goals are much more detailed. From that point of view, the response to your post is “Huh?”, it doesn’t engage the crux of the disagreement.
“Huh?” was exactly my reaction. My values don’t vary depending on any environmental input; after all, they are the “ground truths” that give meaning to everything else.
The point of the post is that these are strategically different kinds of value, wrapper-mind goal and human values. Complexity in case of humans is not evidence for the distinction, the standard position is that the stuff you are describing is complexity of extrapolated human wrapper-mind goal, not different kind of value, a paperclipper whose goals are much more detailed. From that point of view, the response to your post is “Huh?”, it doesn’t engage the crux of the disagreement.
Expecting wrapper-minds as an appropriate notion of human value is the result of following selection theorem reasoning. Consistent decision making seems to imply wrapper-minds, and furthermore there is a convergent drive towards their formation as mesa-optimizers under optimization pressure. It is therefore expected that AGIs become wrapper-minds in short order (or at least eventually) even if they are not immediately designed this way. If they are aligned, this is even a good thing, since wrapper-minds are best at achieving goals, including humanity’s goals. If aligned wrapper-minds are never built, it’s astronomical waste, going about optimization of the future light cone in a monstrously inefficient manner. AI risk probably starts with AGIs that are not wrapper-minds, yet these arguments suggest that the eventual shape of the world is given by wrapper-minds borne of AGIs if they hold control of the future, and their disagreement with human values is going to steamroll the future with things human values won’t find agreeable. Unaligned wrapper-minds are going to be a disaster, and unaligned AGIs that are not wrapper-minds are still going to build/become such unaligned wrapper-minds.
The crux of the disagreement is whether unaligned AGIs that are not wrapper-minds inevitably build/become unaligned wrapper-minds. The option where they never build any wrapper-minds is opposed by the astronomical waste argument, the opportunity cost of not making use of the universe in the most efficient way. This is not impossible, merely impossibly sad. The option where unaligned AGIs build/become aligned wrapper-minds requires some kind of miracle that doesn’t follow from usual understanding of how extrapolation of value/long reflection works, a new coincidence of AGIs with different values independently converging on the same or mutually agreeable goal for the wrapper-minds they would build if individually starting from a position of control, not having to compromise with others.
My guess: there’s a conflict between the mathematically desirable properties of an expected utility maximizer on the one hand, and the very undesirable behaviors of the AI safety culture’s most salient examples of expected utility maximizers on the other (e.g., a paperclip maximizer, a happiness maximizer, etc).
People associate the badness of these sorts of “simple utility functions” EU maximizers with the mathematical EU maximization framework. I think that “EU maximization for humans” looks like an optimal joint policy that reflects a negotiated equilibrium across our entire distribution over diverse values, not some sort of collapse into maximizing a narrow conception of what humans “really” want.
I think of “wrapper mind bad” as referring to the intuitive notion of a simple EU maximizer / paperclipper, which are very bad. Arguing that “EU maximization good” is, I think, true, but not quite getting at the intuition behind “wrapper mind bad”.
Given how every natural goal-seeking agent seems to be built on layers and layers of complex interactions, I have to wonder if “utility” and “goals” are wrong paradigms to use. Not that I have any better ones ready, mind.
The point is not that EU maximizers are always bad in principle, but that utility that won’t be bad is not something we can give to an AGI that acts as an EU maximizer, because it won’t be merely more complicated than the simple utilities from the obviously bad examples, it must be seriously computationally intractable, given by very indirect pointers to value. And optimizing according to an intractable definition of utility is no longer EU maximization (in practice, where compute matters), this framing stops being useful in that case.
It’s only useful for misaligned optimizers or in unbounded-compute theory that doesn’t straightforwardly translate to practice.
If you need to represent some computationally intractable object, there are many tricks available to approximate such an object in a computationally efficient manner. E.g., one can split the intractable object into modular factors, then use only those factors which are most relevant to the current situation. My guess is that this is exactly what values are: modular, tractable factors that let us efficiently approximate a computationally intractable utility function.
If you actually optimize according to an approximation, that’s going to goodhart curse the outcome. Any approximation must only be soft-optimized for, not EU maximized. A design that seeks EU maximization, and hopes for soft optimization that doesn’t go too far, doesn’t pass the omnipotence test.
Also, an approximation worth even soft-optimizing for should be found in a value-laden way, losing inessential details and not something highly value-relevant. Approximate knowledge of values helps with finding better approximations to values.
Although the selection/coherence theorems imply that a mature civilization should act like a wrapper mind, I don’t think that the internals of such a civilization have to look wrapper-mind-like, with potential relevance for what kind of AGI design we should aim for. A highly-efficient civilization composed of human-mind-like AGIs collectively reaching some sort of bargaining equilibrium might ultimately be equivalent to the goal of some wrapper-mind, but directly building the wrapper-mind seems more dangerous because small perturbations in its utility function can destroy all value, whereas intuitively it seems that perturbations in the initial design of the human-mind-like AGI could still lead to a bargaining equilibrium that is valuable from our perspective. This reminds me a bit of how you can formulate classical physics as either locally-following Newton’s laws, or equivalently finding the action-minimizing path. In this case the two formulations are utilitarianism and contractarianism. The benefit of considering the two formulations is that axiologies that are simple/robust in one formulation might be highly unnatural in the other.
For an agent (wrapper-mind), there is still a sharp distinction between goal and outcome. This is made more confusing by the fact that a plausible shape of a human-aligned goal is a living civilization (long reflection) figuring out what it wants to actually happen, and that living civilization looks like an outcome, but it’s not. If its decision is that the outcome should look differently from a continued computation of this goal, because something better is possible, then the civilization doesn’t actually come into existence, and perhaps won’t incarnate significant moral worth, except as a byproduct of having its decisions computed, which need to become known to be implemented. This is different from a wrapper-mind actively demolishing a civilization to build something else, as in this case the civilization never existed (and was never explicitly computed) in the first place. Though it might be impossible to make use of decisions of a computation without explicitly computing it in a natural sense, forcing any goal-as-computation into being part of the outcome. The goal, unlike the rest of the outcome, is special in not being optimized according to the goal.
My current guess at long reflection that is both robust (to errors in initial specification) in pursuing/preserving/extrapolating values and efficient to compute is a story simulation, what you arrive at by steelmanning civilization that exists as a story told by GPT-n. When people are characters in many interacting stories generated by a language model, told in sufficient detail, they can still make decisions, as they are no more puppets to the AI than we are puppets to the laws of physics, provided the AI doesn’t specifically intervene on the level of individual decisions. Formulation of values involves the language model learning from the stories it’s written, judging how desirable/instructive they are and shifting the directions in which new stories get written.
The insurmountable difficulty here is to start with a model that doesn’t already degrade human values beyond hope of eventual alignment, which motivated exact imitation of humans (or WBEs) as the original form of this line of reasoning. The downside is that exact imitation doesn’t necessarily allow efficient computation of its long term results without the choices of approximation in predictions themselves being dependent on values that this process is intended to compute (value-laden prediction).
As I currently understand your points, they seem like not much evidence at all towards the wrapper-mind conclusion.
Why are wrapper-minds an “appropriate notion” of human values, when AFAICT they seem diametrically opposite on many axes (e.g. time-varying, context-dependent)?
Why do you think consistent decision-making implies wrapper-minds?
What is “optimization pressure”, and where is it coming from? What is optimizing the policy networks? SGD? Are the policy networks supposed to be optimizing themselves to become wrapper minds?
Why should we expect unitary mesa-optimizers with globally activated goals, when AFAICT we have never observed this, nor seen relatively large amounts of evidence for it? (I’d be excited to pin down a bet with you about policy internals and generalization.)
Seems doubtful to me, insofar as we imagine wrapper-minds to be grader-optimizers which globally optimize the output of some utility function over all states/universe-histories/whatever, or EU function over all plans.
There are two wrapper-mind conclusions, and the purpose of my comment was to frame the distinction between them. The post seems to be conflating them in the context of AI risk, mostly talking about one of them while alluding to AI risk relevance that seems to instead mostly concern the other. I cited standard reasons for taking either of them seriously, in the forms that make conflating them easy. That doesn’t mean I accept relevance of those reasons.
You can take a look at this comment for something about my own position on human values, which doesn’t seem relevant to this post or my comments here. Specifically, I agree that human values don’t have wrapper-mind character, as should be expressed in people or likely to get expressed in sufficiently human-like AGIs, but I expect that it’s a good idea for humans or those AGIs to eventually build wrapper-minds to manage the universe (and this point seems much more relevant to AI risk). I’ve maintained this distinction for a while.
“Huh?” was exactly my reaction. My values don’t vary depending on any environmental input; after all, they are the “ground truths” that give meaning to everything else.