To get an AI that wants to help humans, just ensure the AI is rewarded for helping humans during training.
To what extent do you expect this to generalize “correctly” outside of the training environment?
In your linked comment, you mention humans being averse to wireheading, but I think that’s only sort-of true: a lot of people who successfully avoid trying heroin because they don’t want to become heroin addicts, do still end up abusing a lot of other evolutionarily-novel superstimuli, like candy, pornography, and video games.
That makes me think inner-misalignment is still going to be a problem when you scale to superintelligence: maybe we evolve an AI “species” that’s genuinely helpful to us in the roughly human-level regime (where its notion of helping and our notion of being-helped, coincide very well), but when the AIs become more powerful than us, they mostly discard the original humans in favor of optimized AI-”helping”-”human” superstimuli.
I guess I could imagine this being an okay future if we happened to get lucky about how robust the generalization turned out to be—maybe the optimized AI-”helping”-”human” superstimuli actually are living good transhuman lives, rather than being a nonsentient “sex toy” that happens to be formed in our image? But I’d really rather not bet the universe on this (if I had the choice not to bet).
Do you know if there’s any research relevant to whether “degree of vulnerability to superstimuli” is correlated with intelligence in humans?
One aspect of inner alignment failures that I think is key to safe generalizations is that values tend to multiply. E.g., the human reward system is an inner alignment failure wrt evolution’s single “value”. Human values are inner alignment failures wrt the reward system. Each step we’ve seen has a significant increase in the breadth / diversity of values (admittedly, we’ve only seen two steps, but IMO it also makes sense that the process of inner alignment failure is orientated towards value diversification).
If even a relatively small fraction of the AI’s values orient towards actually helping humans, I think that’s enough to avert the worst possible futures. From that point, it becomes a matter of ensuring that values are able to perpetuate themselves robustly (currently a major focus of our work on this perspective; prospects seem surprisingly good, but far from certain).
maybe the optimized AI-”helping”-”human” superstimuli actually are living good transhuman lives, rather than being a nonsentient “sex toy” that happens to be formed in our image?
I actually think it would be very likely that such superstimuli are sentient. Humans are sentient. If you look at non-sentient humans (sleeping, sleep walking, trance state, some anesthetic drugs, etc), they typically behave quite differently from normal humans.
If you look at non-sentient humans (sleeping, sleep walking, trance state, some anesthetic drugs, etc), they typically behave quite differently from normal humans.
Yeah, in the training environment. But, as you know, the reason people think inner-misalignment is a problem is precisely because capability gains can unlock exotic new out-of-distribution possibilities that don’t have the same properties.
Boring, old example (skip this paragraph if it’s too boring): humans evolved to value sweetness as an indicator of precious calories, and then we invented asparteme, which is much sweeter for much fewer calories. Someone in the past who reasoned, “If you look at sweet foods, they have a lot of calories; that’ll probably be true in the future”, would have been meaningfully wrong. (We still use actual sugar most of the time, but I think this is a lot like why we still have rainforests: in the limit of arbitrary capabilities, we don’t care about any of the details of “original” sugar except what it tastes like to us.)
Better, more topical example: human artists who create beautiful illustrations on demand experience a certain pride in craftsmanship. Does DALL-E? Notwithstanding whether “it may be that today’s large neural networks are slightly conscious”, I’m going to guess No, there’s nothing in text-to-image models remotely like a human artist’s pride; we figured out how to get the same end result (beautiful art on demand) in an alien, inhuman way that’s not very much like a human artist internally. Someone in the past who reasoned, “The creators of beautiful art will take pride in their craft,” would be wrong.
key to safe generalizations is that values tend to multiply [...] significant increase in the breadth / diversity of values
“Increase in diversity” and “safe generalization” seem like really different things to me? What if some of the new, diverse values are actually bad from our perspective? (Something like, being forced to smile might make you actually unhappy despite the outward appearance of your face, but a human-smile-maximizer doesn’t care about that, and this future is more diverse than the present because the present doesn’t have any smile-maximizers.)
Basically, some of your comments make me worry that you’re suffering from a bit of anthropmorphic optimism?
At the same time, however, I think this line of research is very interesting and I’m excited to see where you go with it! Yudkowsky tends to do this lame thing where after explaining the inner-alignment/context-disaster problem, he skips to, “And therefore, because there’s no obvious relationship between the outer loss funciton and learned values, and because the space of possibilities is so large, almost all of it has no value, like paperclips.” I think there’s a lot of missing argumentation there, and discovering the correct arguments could change the conclusion and our decisions a lot! (In the standard metaphor, we’re not really in the position of “evolution” with respect to AI so much as we are the environment of evolutionary adaptedness.) It’s just, we need to be careful to be asking, “Okay, what actually happens with inner alignment failures; what’s the actual outcome specifically?” without trying to “force” that search into finding reassuring fake reasons why the future is actually OK.
Basically, some of your comments make me worry that you’re suffering from a bit of anthropmorphic optimism?
Ironically, one of my medium-sized issues with mainline alignment thinking is that it seems to underweight the evidence we get from observing humans and human values. The human brain is, by far, the most general and agentic learning system in current existence. We also have ~7 billion examples of human value learning to observe. The data they provide should strongly inform our intuitions on how other highly general and agentic learning systems behave. When you have limited evidence about a domain, what little evidence you do have should strongly inform your intuitions.
In fact, our observations of humans should inform our expectations of AGIs much more strongly than the above argument implies because we are going to train those AGIs on data generated by humans. It’s well known in deep learning that training data are usually more important than details of the learning process or architecture.
I think alignment thinking has an inappropriately strong bias against anchoring expectations to our observations of humans. There’s an assumption that the human learning algorithm is in some way “unnatural” among the space of general and effective learning algorithms, and that we therefore can’t draw inferences about AGIs based on our observations of humans. E.g., Eliezer Yudkowsky’s post My Childhood Role Model:
Humans are adapted to chase deer across the savanna, throw spears into them, cook them, and then—this is probably the part that takes most of the brains—cleverly argue that they deserve to receive a larger share of the meat.
It’s amazing that Albert Einstein managed to repurpose a brain like that for the task of doing physics. This deserves applause. It deserves more than applause, it deserves a place in the Guinness Book of Records. Like successfully building the fastest car ever to be made entirely out of Jello.
How poorly did the blind idiot god (evolution) really design the human brain?
This is something that can only be grasped through much study of cognitive science, until the full horror begins to dawn upon you.
All the biases we have discussed here should at least be a hint.
Likewise the fact that the human brain must use its full power and concentration, with trillions of synapses firing, to multiply out two three-digit numbers without a paper and pencil.
Yudkowsky notes that a learning algorithm hyper-specialized to the ancestral environment would not generalize well to thinking about non-ancestral domains like physics. This is absolutely correct, and it represents a significant misprediction of any view assigning a high degree of specialization to the brain’s learning algorithm.
In fact, large language models arguably implement social instincts with more adroitness than many humans possess. However, original physics research in the style of Einstein remains well out of reach. This is exactly the opposite of what you should predict if you believe that evolution hard coded most of the brain to “cleverly argue that they deserve to receive a larger share of the meat”.
Yudkowsky brings up multiplication as an example of a task that humans perform poorly at, supposedly because brains specialized to the ancestral environment had no need for such capabilities. And yet, GPT-3 is also terrible at multiplication (even after accounting for the BPE issue), and no part of its architecture or training procedure is at all specialized for the human ancestral environment.
“Increase in diversity” and “safe generalization” seem like really different things to me? What if some of the new, diverse values are actually bad from our perspective? (Something like, being forced to smile might make you actually unhappy despite the outward appearance of your face, but a human-smile-maximizer doesn’t care about that, and this future is more diverse than the present because the present doesn’t have any smile-maximizers.)
But a smile maximizer would have less diverse values than a human? It only cares about smiles, after all. When you say “smile maximizer”, is that shorthand for a system with a broad distribution over different values, one of which happens to be smile maximization? That’s closer to how I think of things, with the system’s high level behavior arising as a sort of negotiated agreement between its various values.
IMO, systems with broader distributions over values are more likely to assign at least some weight to things like “make people actually happy” and to other values that we don’t even know we should have included. In that case, the “make people actually happy” value and the “smile maximization” value can cooperate and make people smile by being happy (and also cooperate with the various other values the system develops). That’s the source of my intuition that a broader distributions over values are actually safer: it makes you less likely to miss something important.
More generally, I think that a lot of the alignment intuition that “values are fragile” actually comes from a pretty simple type error. Consider:
The computation a system executes depends on its inputs. If you have some distribution over possible inputs, that translates to having a distribution over possible computations.
“Values” is just a label we apply to particular components of a system’s computations.
If a system has a situation-dependent distribution over possible computations, and values are implemented by those computations, then the system also has a situation-dependent distribution over possible values.
However, people can only consciously instantiate a small, subset of discrete values at any given time. There thus appears to be a contrast between “the values we can imagine” and “the values we actually have”. Trying to list out a discrete set of “true human values” roughly corresponds to trying to represent a continuous distribution with a small set of discrete samples from that distribution (this is the type error in question). It doesn’t help that the distribution over values is situation-dependent, so any sampling of their values a human performs in one situation may not transfer to the samples they’d take in another situation.
Given the above, it should be no surprise that our values feel “fragile” when we introspect on them.
Preempting a possible confusion: the above treats a “value” and “the computation that implements that value” interchangeably. If you’re thinking of a “value” as something like a principle component of an agent’s utility function, somehow kept separate from the system that actually implements those values, then this might seem counterintuitive.
Under this framing, questions like the destruction of the physical rainforests, or other things we might value are mainly about ensuring a broad distribution of worthwhile values can perpetuate themselves across time and are influence the world to at least some degree. “Preserving X”, for any value of X, is about ensuring that the system has at least some values orientated towards preserving X, that those values can persist over time, and that those values can actually ensure that X is preserved. (And the broader the values, the more different Xs we can preserve.)
I think the prospects for achieving those three things are pretty good, though I don’t think I’m ready to write up my full case for believing such.
(I do admit that it’s possible to have a system that ends up pursing a simple / “dumb” goal, such as maximizing paperclips, to the exclusion of all else. That can happen when the system’s distribution over possible values places so much weight on paperclip-adjacent values that they can always override any other values. This is another reason I’m in favor of broad distributions over values.)
Yudkowsky tends to do this lame thing where after explaining the inner-alignment/context-disaster problem, he skips to, “And therefore, because there’s no obvious relationship between the outer loss funciton and learned values, and because the space of possibilities is so large, almost all of it has no value, like paperclips.”
Agreed. It’s particularly annoying because, IMO, there is a strong candidate for “obvious relationship between the outer loss funciton and learned values”: learned values reflect the distribution over past computations that achieved high reward on the various shallow proxies of the outer loss function that the model encountered during training.
In fact, large language models arguably implement social instincts with more adroitness than many humans possess.
Large language models implement social behavior as expressed in text. I don’t want to call that social “instincts”, because the implementation, and out-of-distribution behavior, is surely going to be very different.
But a smile maximizer would have less diverse values than a human? It only cares about smiles, after all. When you say “smile maximizer”, is that shorthand for a system with a broad distribution over different values, one of which happens to be smile maximization?
A future with humans and smile-maximizers is more diverse than a future with just humans. (But, yes, “smile maximizer” here is our standard probably-unrealistic stock example standing in for inner alignment failures in general.)
That’s the source of my intuition that a broader distributions over values are actually safer: it makes you less likely to miss something important.
Trying again: the reason I don’t want to call that “safety” is because, even if you’re less likely to completely miss something important, you’re more likely to accidentally incorporate something you actively don’t want.
If we start out with a system where I press the reward button when the AI makes me happy, and it scales and generalizes into a diverse coalition of a smile-maximizer, and a number-of-times-the-human-presses-the-reward-button-maximizer, and an amount-of-dopamine-in-the-human’s-brain-maximizer, plus a dozen or a thousand other things … okay, I could maybe believe that some fraction of the comic endowment gets used in ways that I approve of.
But what if most of it is things like … copies of me with my hand wired up to hit the reward button ten times a second, my face frozen into a permanent grin, while drugged up on a substance that increases serotonin levels, which correlated with “happiness” in the training environment, but when optimized subjectively amounts to an unpleasant form of manic insanity? That is, what if some parts of “diverse” are Actually Bad? I would really rather not roll the dice on this, if I had the choice!
To what extent do you expect this to generalize “correctly” outside of the training environment?
In your linked comment, you mention humans being averse to wireheading, but I think that’s only sort-of true: a lot of people who successfully avoid trying heroin because they don’t want to become heroin addicts, do still end up abusing a lot of other evolutionarily-novel superstimuli, like candy, pornography, and video games.
That makes me think inner-misalignment is still going to be a problem when you scale to superintelligence: maybe we evolve an AI “species” that’s genuinely helpful to us in the roughly human-level regime (where its notion of helping and our notion of being-helped, coincide very well), but when the AIs become more powerful than us, they mostly discard the original humans in favor of optimized AI-”helping”-”human” superstimuli.
I guess I could imagine this being an okay future if we happened to get lucky about how robust the generalization turned out to be—maybe the optimized AI-”helping”-”human” superstimuli actually are living good transhuman lives, rather than being a nonsentient “sex toy” that happens to be formed in our image? But I’d really rather not bet the universe on this (if I had the choice not to bet).
Do you know if there’s any research relevant to whether “degree of vulnerability to superstimuli” is correlated with intelligence in humans?
One aspect of inner alignment failures that I think is key to safe generalizations is that values tend to multiply. E.g., the human reward system is an inner alignment failure wrt evolution’s single “value”. Human values are inner alignment failures wrt the reward system. Each step we’ve seen has a significant increase in the breadth / diversity of values (admittedly, we’ve only seen two steps, but IMO it also makes sense that the process of inner alignment failure is orientated towards value diversification).
If even a relatively small fraction of the AI’s values orient towards actually helping humans, I think that’s enough to avert the worst possible futures. From that point, it becomes a matter of ensuring that values are able to perpetuate themselves robustly (currently a major focus of our work on this perspective; prospects seem surprisingly good, but far from certain).
I actually think it would be very likely that such superstimuli are sentient. Humans are sentient. If you look at non-sentient humans (sleeping, sleep walking, trance state, some anesthetic drugs, etc), they typically behave quite differently from normal humans.
Yeah, in the training environment. But, as you know, the reason people think inner-misalignment is a problem is precisely because capability gains can unlock exotic new out-of-distribution possibilities that don’t have the same properties.
Boring, old example (skip this paragraph if it’s too boring): humans evolved to value sweetness as an indicator of precious calories, and then we invented asparteme, which is much sweeter for much fewer calories. Someone in the past who reasoned, “If you look at sweet foods, they have a lot of calories; that’ll probably be true in the future”, would have been meaningfully wrong. (We still use actual sugar most of the time, but I think this is a lot like why we still have rainforests: in the limit of arbitrary capabilities, we don’t care about any of the details of “original” sugar except what it tastes like to us.)
Better, more topical example: human artists who create beautiful illustrations on demand experience a certain pride in craftsmanship. Does DALL-E? Notwithstanding whether “it may be that today’s large neural networks are slightly conscious”, I’m going to guess No, there’s nothing in text-to-image models remotely like a human artist’s pride; we figured out how to get the same end result (beautiful art on demand) in an alien, inhuman way that’s not very much like a human artist internally. Someone in the past who reasoned, “The creators of beautiful art will take pride in their craft,” would be wrong.
“Increase in diversity” and “safe generalization” seem like really different things to me? What if some of the new, diverse values are actually bad from our perspective? (Something like, being forced to smile might make you actually unhappy despite the outward appearance of your face, but a human-smile-maximizer doesn’t care about that, and this future is more diverse than the present because the present doesn’t have any smile-maximizers.)
Basically, some of your comments make me worry that you’re suffering from a bit of anthropmorphic optimism?
At the same time, however, I think this line of research is very interesting and I’m excited to see where you go with it! Yudkowsky tends to do this lame thing where after explaining the inner-alignment/context-disaster problem, he skips to, “And therefore, because there’s no obvious relationship between the outer loss funciton and learned values, and because the space of possibilities is so large, almost all of it has no value, like paperclips.” I think there’s a lot of missing argumentation there, and discovering the correct arguments could change the conclusion and our decisions a lot! (In the standard metaphor, we’re not really in the position of “evolution” with respect to AI so much as we are the environment of evolutionary adaptedness.) It’s just, we need to be careful to be asking, “Okay, what actually happens with inner alignment failures; what’s the actual outcome specifically?” without trying to “force” that search into finding reassuring fake reasons why the future is actually OK.
Ironically, one of my medium-sized issues with mainline alignment thinking is that it seems to underweight the evidence we get from observing humans and human values. The human brain is, by far, the most general and agentic learning system in current existence. We also have ~7 billion examples of human value learning to observe. The data they provide should strongly inform our intuitions on how other highly general and agentic learning systems behave. When you have limited evidence about a domain, what little evidence you do have should strongly inform your intuitions.
In fact, our observations of humans should inform our expectations of AGIs much more strongly than the above argument implies because we are going to train those AGIs on data generated by humans. It’s well known in deep learning that training data are usually more important than details of the learning process or architecture.
I think alignment thinking has an inappropriately strong bias against anchoring expectations to our observations of humans. There’s an assumption that the human learning algorithm is in some way “unnatural” among the space of general and effective learning algorithms, and that we therefore can’t draw inferences about AGIs based on our observations of humans. E.g., Eliezer Yudkowsky’s post My Childhood Role Model:
Yudkowsky notes that a learning algorithm hyper-specialized to the ancestral environment would not generalize well to thinking about non-ancestral domains like physics. This is absolutely correct, and it represents a significant misprediction of any view assigning a high degree of specialization to the brain’s learning algorithm.
In fact, large language models arguably implement social instincts with more adroitness than many humans possess. However, original physics research in the style of Einstein remains well out of reach. This is exactly the opposite of what you should predict if you believe that evolution hard coded most of the brain to “cleverly argue that they deserve to receive a larger share of the meat”.
Yudkowsky brings up multiplication as an example of a task that humans perform poorly at, supposedly because brains specialized to the ancestral environment had no need for such capabilities. And yet, GPT-3 is also terrible at multiplication (even after accounting for the BPE issue), and no part of its architecture or training procedure is at all specialized for the human ancestral environment.
But a smile maximizer would have less diverse values than a human? It only cares about smiles, after all. When you say “smile maximizer”, is that shorthand for a system with a broad distribution over different values, one of which happens to be smile maximization? That’s closer to how I think of things, with the system’s high level behavior arising as a sort of negotiated agreement between its various values.
IMO, systems with broader distributions over values are more likely to assign at least some weight to things like “make people actually happy” and to other values that we don’t even know we should have included. In that case, the “make people actually happy” value and the “smile maximization” value can cooperate and make people smile by being happy (and also cooperate with the various other values the system develops). That’s the source of my intuition that a broader distributions over values are actually safer: it makes you less likely to miss something important.
More generally, I think that a lot of the alignment intuition that “values are fragile” actually comes from a pretty simple type error. Consider:
The computation a system executes depends on its inputs. If you have some distribution over possible inputs, that translates to having a distribution over possible computations.
“Values” is just a label we apply to particular components of a system’s computations.
If a system has a situation-dependent distribution over possible computations, and values are implemented by those computations, then the system also has a situation-dependent distribution over possible values.
However, people can only consciously instantiate a small, subset of discrete values at any given time. There thus appears to be a contrast between “the values we can imagine” and “the values we actually have”. Trying to list out a discrete set of “true human values” roughly corresponds to trying to represent a continuous distribution with a small set of discrete samples from that distribution (this is the type error in question). It doesn’t help that the distribution over values is situation-dependent, so any sampling of their values a human performs in one situation may not transfer to the samples they’d take in another situation.
Given the above, it should be no surprise that our values feel “fragile” when we introspect on them.
Preempting a possible confusion: the above treats a “value” and “the computation that implements that value” interchangeably. If you’re thinking of a “value” as something like a principle component of an agent’s utility function, somehow kept separate from the system that actually implements those values, then this might seem counterintuitive.
Under this framing, questions like the destruction of the physical rainforests, or other things we might value are mainly about ensuring a broad distribution of worthwhile values can perpetuate themselves across time and are influence the world to at least some degree. “Preserving X”, for any value of X, is about ensuring that the system has at least some values orientated towards preserving X, that those values can persist over time, and that those values can actually ensure that X is preserved. (And the broader the values, the more different Xs we can preserve.)
I think the prospects for achieving those three things are pretty good, though I don’t think I’m ready to write up my full case for believing such.
(I do admit that it’s possible to have a system that ends up pursing a simple / “dumb” goal, such as maximizing paperclips, to the exclusion of all else. That can happen when the system’s distribution over possible values places so much weight on paperclip-adjacent values that they can always override any other values. This is another reason I’m in favor of broad distributions over values.)
Agreed. It’s particularly annoying because, IMO, there is a strong candidate for “obvious relationship between the outer loss funciton and learned values”: learned values reflect the distribution over past computations that achieved high reward on the various shallow proxies of the outer loss function that the model encountered during training.
(Thanks for your patience.)
Large language models implement social behavior as expressed in text. I don’t want to call that social “instincts”, because the implementation, and out-of-distribution behavior, is surely going to be very different.
A future with humans and smile-maximizers is more diverse than a future with just humans. (But, yes, “smile maximizer” here is our standard probably-unrealistic stock example standing in for inner alignment failures in general.)
Trying again: the reason I don’t want to call that “safety” is because, even if you’re less likely to completely miss something important, you’re more likely to accidentally incorporate something you actively don’t want.
If we start out with a system where I press the reward button when the AI makes me happy, and it scales and generalizes into a diverse coalition of a smile-maximizer, and a number-of-times-the-human-presses-the-reward-button-maximizer, and an amount-of-dopamine-in-the-human’s-brain-maximizer, plus a dozen or a thousand other things … okay, I could maybe believe that some fraction of the comic endowment gets used in ways that I approve of.
But what if most of it is things like … copies of me with my hand wired up to hit the reward button ten times a second, my face frozen into a permanent grin, while drugged up on a substance that increases serotonin levels, which correlated with “happiness” in the training environment, but when optimized subjectively amounts to an unpleasant form of manic insanity? That is, what if some parts of “diverse” are Actually Bad? I would really rather not roll the dice on this, if I had the choice!