The vast majority of humans would not destroy the world [...] This is in stark contrast to how AIs are assumed to destroy the world by default.
Most humans would (and do) seek power and resources in a way that is bad for other systems that happen to be in the way (e.g., rainforests). When we colloquially talk about AIs “destroying the world” by default, it’s a very self-centered summary: the world isn’t actually “destroyed”, just radically transformed in a way that doesn’t end with any of the existing humans being alive, much like how our civilization transforms the Earth in ways that cut down existing forests.
You might reply: but wild nature still exists; we don’t cut down all the forests! True, but an important question here is to what extent is that due to “actual” environmentalist/conservationist preferences in humans, and to what extent is it just that we “didn’t get around to it yet” at our current capability levels?
In today’s world, people who care about forest animals, and people who enjoy the experience of being in a forest, both have an interest in protecting forests. In the limit of arbitrarily advanced technology, this is less obvious: it’s probably more efficient to turn everything into an optimal computing substrate, and just simulate happy forest animals for the animal lovers and optimal forest scenery for the scenery-lovers. Any fine details of the original forest that the humans don’t care about (e.g., the internals of plants) would be lost.
The vast majority of humans would not destroy the world, even given unlimited power to enact their preferences unopposed.
I was specifically talking about the preferences of an individual human. The behavior of the economic systems that derive from the actions of many humans need not be aligned with the preferences of any component part of said systems. For AIs, we’re currently interested in the values that arise in a single AIs (specifically, the first AI capable of a hard takeoff), so single humans are the more appropriate reference class.
the world isn’t actually “destroyed”, just radically transformed in a way that doesn’t end with any of the existing humans being alive
In fact, “radically transformed in a way that doesn’t end with any of the existing humans being alive” is what I meant by “destroyed”. That’s the thing that very few current humans would do, given sufficient power. That’s the thing that we’re concerned that future AIs might do, given sufficient power. You might have a different definition of the word “destroyed”, but I’m not using that definition.
I believe that there are plenty of people who would destroy the world. I do know at least one personally. I don’t know very many people to the extent that I could even hazard a guess as to whether they actually would or not, so either I am very fortunate (!) to know one of this tiny number, or there are at least millions of them and possibly hundreds of millions.
I am pretty certain that most humans would destroy the world if there was any conflict between that and any of their strongest values. The world persists only because there are no gods. The most powerful people to ever have existed have been powerful only because of the power granted to them by other humans. Remove that limitation and grant absolute power and superhuman intelligence along with capacity for further self-modification to a single person, and I give far better than even odds that what results is utterly catastrophic.
Let’s suppose there are ~300 million people who’d use their unlimited power to destroy the world (I think the true number is far smaller). That would mean > 95% of people wouldn’t do so. Suppose there were an alignment scheme that we’d tested billions of times on human-level AGIs, and > 95% of the time, it resulted in values compatible with humanity’s continued survival. I think that would be a pretty promising scheme.
grant absolute power and superhuman intelligence along with capacity for further self-modification to a single person, and I give far better than even odds that what results is utterly catastrophic.
If there were a process that predictably resulted in me having values strongly contrary to those I currently posses, I wouldn’t do it. The vast majority of people won’t take pills that turn them into murderers. For the same reason, an aligned AI at slightly superhuman capabilities levels won’t self modify without first becoming confidant that its self modification will preserve its values. Most likely, it would instead develop better alignment tech than we used to create said AI and create a more powerful aligned successor.
I think that a 95% success rate in not destroying the human world would also be fantastic, though I note that there are plenty more potential totalitarian hellscapes that some people would apparently rate even worse than extinction.
Note that I’m not saying that they would deliberately destroy the world for shits and giggles, just that if the rest of the human world was any impediment to anything they valued more, then its destruction would just be a side effect of what had to be done.
I also don’t have any illusion that a superintelligent agent will be infallible. The laws of the universe are not kind, and great power brings the opportunity for causing great disasters. I fully expect that any super-civilizational entity of any level of intelligence could very well destroy the human world by mistake.
“radically transformed in a way that doesn’t end with any of the existing humans being alive” is what I meant by “destroyed”
Great, we’re on the same page.
That’s the thing that very few current humans would do, given sufficient power. That’s the thing that we’re concerned that future AIs might do, given sufficient power.
I think I’m expressing skepticism that inner-misaligned adaptations in simple learning algorithms are enough to license using current humans as a reference class quite this casually?
The “traditional” Yudkowskian position says, “Just think of AI as something that computes plans that achieve outcomes; logically, a paperclip maximizer is going to eat you and use your atoms to make paperclips.” I read you as saying that AIs trained using anything like current-day machine learning techniques aren’t going to be pure consequentialists like that; they’ll have a mess of inner-misaligned “adaptations” and “instincts”, like us. I agree that this is plausible, but I think it suggests “AI will be like another evolved species” rather than “AI will be like humans” as our best current-world analogy, and the logic of “different preferences + more power = genocide” still seems likely to apply across a gap that large (even if it’s smaller than the gap to a pure consequentialist)?
...I think it suggests “AI will be like another evolved species” rather than “AI will be like humans”...
This was close to my initial assumption as well. I’ve since spent a lot of time thinking about the dynamics that arise from inner alignment failures in a human-like learning system, essentially trying to apply microeconomics to the internal “economy” of optimization demons that would result from an inner alignment failure. You can see this comment for some preliminary thoughts along these lines. A startling fraction of our deepest morality-related intuitions seem to derive pretty naturally / robustly from the multi-agent incentives associated with an inner alignment failure.
Moreover, I think that there may be a pretty straightforward relationship between a learning system’s reward function and the actual values it develops: values are self-perpetuating, context-dependent strategies that obtained high reward during training. If you want to ensure a learning system develops a given value, it may simply be enough to ensure that the system is rewarded for implementing the associated strategy during training. To get an AI that wants to help humans, just ensure the AI is rewarded for helping humans during training.
To get an AI that wants to help humans, just ensure the AI is rewarded for helping humans during training.
To what extent do you expect this to generalize “correctly” outside of the training environment?
In your linked comment, you mention humans being averse to wireheading, but I think that’s only sort-of true: a lot of people who successfully avoid trying heroin because they don’t want to become heroin addicts, do still end up abusing a lot of other evolutionarily-novel superstimuli, like candy, pornography, and video games.
That makes me think inner-misalignment is still going to be a problem when you scale to superintelligence: maybe we evolve an AI “species” that’s genuinely helpful to us in the roughly human-level regime (where its notion of helping and our notion of being-helped, coincide very well), but when the AIs become more powerful than us, they mostly discard the original humans in favor of optimized AI-”helping”-”human” superstimuli.
I guess I could imagine this being an okay future if we happened to get lucky about how robust the generalization turned out to be—maybe the optimized AI-”helping”-”human” superstimuli actually are living good transhuman lives, rather than being a nonsentient “sex toy” that happens to be formed in our image? But I’d really rather not bet the universe on this (if I had the choice not to bet).
Do you know if there’s any research relevant to whether “degree of vulnerability to superstimuli” is correlated with intelligence in humans?
One aspect of inner alignment failures that I think is key to safe generalizations is that values tend to multiply. E.g., the human reward system is an inner alignment failure wrt evolution’s single “value”. Human values are inner alignment failures wrt the reward system. Each step we’ve seen has a significant increase in the breadth / diversity of values (admittedly, we’ve only seen two steps, but IMO it also makes sense that the process of inner alignment failure is orientated towards value diversification).
If even a relatively small fraction of the AI’s values orient towards actually helping humans, I think that’s enough to avert the worst possible futures. From that point, it becomes a matter of ensuring that values are able to perpetuate themselves robustly (currently a major focus of our work on this perspective; prospects seem surprisingly good, but far from certain).
maybe the optimized AI-”helping”-”human” superstimuli actually are living good transhuman lives, rather than being a nonsentient “sex toy” that happens to be formed in our image?
I actually think it would be very likely that such superstimuli are sentient. Humans are sentient. If you look at non-sentient humans (sleeping, sleep walking, trance state, some anesthetic drugs, etc), they typically behave quite differently from normal humans.
If you look at non-sentient humans (sleeping, sleep walking, trance state, some anesthetic drugs, etc), they typically behave quite differently from normal humans.
Yeah, in the training environment. But, as you know, the reason people think inner-misalignment is a problem is precisely because capability gains can unlock exotic new out-of-distribution possibilities that don’t have the same properties.
Boring, old example (skip this paragraph if it’s too boring): humans evolved to value sweetness as an indicator of precious calories, and then we invented asparteme, which is much sweeter for much fewer calories. Someone in the past who reasoned, “If you look at sweet foods, they have a lot of calories; that’ll probably be true in the future”, would have been meaningfully wrong. (We still use actual sugar most of the time, but I think this is a lot like why we still have rainforests: in the limit of arbitrary capabilities, we don’t care about any of the details of “original” sugar except what it tastes like to us.)
Better, more topical example: human artists who create beautiful illustrations on demand experience a certain pride in craftsmanship. Does DALL-E? Notwithstanding whether “it may be that today’s large neural networks are slightly conscious”, I’m going to guess No, there’s nothing in text-to-image models remotely like a human artist’s pride; we figured out how to get the same end result (beautiful art on demand) in an alien, inhuman way that’s not very much like a human artist internally. Someone in the past who reasoned, “The creators of beautiful art will take pride in their craft,” would be wrong.
key to safe generalizations is that values tend to multiply [...] significant increase in the breadth / diversity of values
“Increase in diversity” and “safe generalization” seem like really different things to me? What if some of the new, diverse values are actually bad from our perspective? (Something like, being forced to smile might make you actually unhappy despite the outward appearance of your face, but a human-smile-maximizer doesn’t care about that, and this future is more diverse than the present because the present doesn’t have any smile-maximizers.)
Basically, some of your comments make me worry that you’re suffering from a bit of anthropmorphic optimism?
At the same time, however, I think this line of research is very interesting and I’m excited to see where you go with it! Yudkowsky tends to do this lame thing where after explaining the inner-alignment/context-disaster problem, he skips to, “And therefore, because there’s no obvious relationship between the outer loss funciton and learned values, and because the space of possibilities is so large, almost all of it has no value, like paperclips.” I think there’s a lot of missing argumentation there, and discovering the correct arguments could change the conclusion and our decisions a lot! (In the standard metaphor, we’re not really in the position of “evolution” with respect to AI so much as we are the environment of evolutionary adaptedness.) It’s just, we need to be careful to be asking, “Okay, what actually happens with inner alignment failures; what’s the actual outcome specifically?” without trying to “force” that search into finding reassuring fake reasons why the future is actually OK.
Basically, some of your comments make me worry that you’re suffering from a bit of anthropmorphic optimism?
Ironically, one of my medium-sized issues with mainline alignment thinking is that it seems to underweight the evidence we get from observing humans and human values. The human brain is, by far, the most general and agentic learning system in current existence. We also have ~7 billion examples of human value learning to observe. The data they provide should strongly inform our intuitions on how other highly general and agentic learning systems behave. When you have limited evidence about a domain, what little evidence you do have should strongly inform your intuitions.
In fact, our observations of humans should inform our expectations of AGIs much more strongly than the above argument implies because we are going to train those AGIs on data generated by humans. It’s well known in deep learning that training data are usually more important than details of the learning process or architecture.
I think alignment thinking has an inappropriately strong bias against anchoring expectations to our observations of humans. There’s an assumption that the human learning algorithm is in some way “unnatural” among the space of general and effective learning algorithms, and that we therefore can’t draw inferences about AGIs based on our observations of humans. E.g., Eliezer Yudkowsky’s post My Childhood Role Model:
Humans are adapted to chase deer across the savanna, throw spears into them, cook them, and then—this is probably the part that takes most of the brains—cleverly argue that they deserve to receive a larger share of the meat.
It’s amazing that Albert Einstein managed to repurpose a brain like that for the task of doing physics. This deserves applause. It deserves more than applause, it deserves a place in the Guinness Book of Records. Like successfully building the fastest car ever to be made entirely out of Jello.
How poorly did the blind idiot god (evolution) really design the human brain?
This is something that can only be grasped through much study of cognitive science, until the full horror begins to dawn upon you.
All the biases we have discussed here should at least be a hint.
Likewise the fact that the human brain must use its full power and concentration, with trillions of synapses firing, to multiply out two three-digit numbers without a paper and pencil.
Yudkowsky notes that a learning algorithm hyper-specialized to the ancestral environment would not generalize well to thinking about non-ancestral domains like physics. This is absolutely correct, and it represents a significant misprediction of any view assigning a high degree of specialization to the brain’s learning algorithm.
In fact, large language models arguably implement social instincts with more adroitness than many humans possess. However, original physics research in the style of Einstein remains well out of reach. This is exactly the opposite of what you should predict if you believe that evolution hard coded most of the brain to “cleverly argue that they deserve to receive a larger share of the meat”.
Yudkowsky brings up multiplication as an example of a task that humans perform poorly at, supposedly because brains specialized to the ancestral environment had no need for such capabilities. And yet, GPT-3 is also terrible at multiplication (even after accounting for the BPE issue), and no part of its architecture or training procedure is at all specialized for the human ancestral environment.
“Increase in diversity” and “safe generalization” seem like really different things to me? What if some of the new, diverse values are actually bad from our perspective? (Something like, being forced to smile might make you actually unhappy despite the outward appearance of your face, but a human-smile-maximizer doesn’t care about that, and this future is more diverse than the present because the present doesn’t have any smile-maximizers.)
But a smile maximizer would have less diverse values than a human? It only cares about smiles, after all. When you say “smile maximizer”, is that shorthand for a system with a broad distribution over different values, one of which happens to be smile maximization? That’s closer to how I think of things, with the system’s high level behavior arising as a sort of negotiated agreement between its various values.
IMO, systems with broader distributions over values are more likely to assign at least some weight to things like “make people actually happy” and to other values that we don’t even know we should have included. In that case, the “make people actually happy” value and the “smile maximization” value can cooperate and make people smile by being happy (and also cooperate with the various other values the system develops). That’s the source of my intuition that a broader distributions over values are actually safer: it makes you less likely to miss something important.
More generally, I think that a lot of the alignment intuition that “values are fragile” actually comes from a pretty simple type error. Consider:
The computation a system executes depends on its inputs. If you have some distribution over possible inputs, that translates to having a distribution over possible computations.
“Values” is just a label we apply to particular components of a system’s computations.
If a system has a situation-dependent distribution over possible computations, and values are implemented by those computations, then the system also has a situation-dependent distribution over possible values.
However, people can only consciously instantiate a small, subset of discrete values at any given time. There thus appears to be a contrast between “the values we can imagine” and “the values we actually have”. Trying to list out a discrete set of “true human values” roughly corresponds to trying to represent a continuous distribution with a small set of discrete samples from that distribution (this is the type error in question). It doesn’t help that the distribution over values is situation-dependent, so any sampling of their values a human performs in one situation may not transfer to the samples they’d take in another situation.
Given the above, it should be no surprise that our values feel “fragile” when we introspect on them.
Preempting a possible confusion: the above treats a “value” and “the computation that implements that value” interchangeably. If you’re thinking of a “value” as something like a principle component of an agent’s utility function, somehow kept separate from the system that actually implements those values, then this might seem counterintuitive.
Under this framing, questions like the destruction of the physical rainforests, or other things we might value are mainly about ensuring a broad distribution of worthwhile values can perpetuate themselves across time and are influence the world to at least some degree. “Preserving X”, for any value of X, is about ensuring that the system has at least some values orientated towards preserving X, that those values can persist over time, and that those values can actually ensure that X is preserved. (And the broader the values, the more different Xs we can preserve.)
I think the prospects for achieving those three things are pretty good, though I don’t think I’m ready to write up my full case for believing such.
(I do admit that it’s possible to have a system that ends up pursing a simple / “dumb” goal, such as maximizing paperclips, to the exclusion of all else. That can happen when the system’s distribution over possible values places so much weight on paperclip-adjacent values that they can always override any other values. This is another reason I’m in favor of broad distributions over values.)
Yudkowsky tends to do this lame thing where after explaining the inner-alignment/context-disaster problem, he skips to, “And therefore, because there’s no obvious relationship between the outer loss funciton and learned values, and because the space of possibilities is so large, almost all of it has no value, like paperclips.”
Agreed. It’s particularly annoying because, IMO, there is a strong candidate for “obvious relationship between the outer loss funciton and learned values”: learned values reflect the distribution over past computations that achieved high reward on the various shallow proxies of the outer loss function that the model encountered during training.
In fact, large language models arguably implement social instincts with more adroitness than many humans possess.
Large language models implement social behavior as expressed in text. I don’t want to call that social “instincts”, because the implementation, and out-of-distribution behavior, is surely going to be very different.
But a smile maximizer would have less diverse values than a human? It only cares about smiles, after all. When you say “smile maximizer”, is that shorthand for a system with a broad distribution over different values, one of which happens to be smile maximization?
A future with humans and smile-maximizers is more diverse than a future with just humans. (But, yes, “smile maximizer” here is our standard probably-unrealistic stock example standing in for inner alignment failures in general.)
That’s the source of my intuition that a broader distributions over values are actually safer: it makes you less likely to miss something important.
Trying again: the reason I don’t want to call that “safety” is because, even if you’re less likely to completely miss something important, you’re more likely to accidentally incorporate something you actively don’t want.
If we start out with a system where I press the reward button when the AI makes me happy, and it scales and generalizes into a diverse coalition of a smile-maximizer, and a number-of-times-the-human-presses-the-reward-button-maximizer, and an amount-of-dopamine-in-the-human’s-brain-maximizer, plus a dozen or a thousand other things … okay, I could maybe believe that some fraction of the comic endowment gets used in ways that I approve of.
But what if most of it is things like … copies of me with my hand wired up to hit the reward button ten times a second, my face frozen into a permanent grin, while drugged up on a substance that increases serotonin levels, which correlated with “happiness” in the training environment, but when optimized subjectively amounts to an unpleasant form of manic insanity? That is, what if some parts of “diverse” are Actually Bad? I would really rather not roll the dice on this, if I had the choice!
For AIs, we’re currently interested in the values that arise in a single AIs (specifically, the first AI capable of a hard takeoff), so single humans are the more appropriate reference class.
I’m sorry but I don’t understand why looking at single AIs make single humans the more appropriate reference class.
I’m drawing an analogy between AI training and human learning. I don’t think the process of training an AI via reinforcement learning is as different from human learning as many assume.
Most humans would (and do) seek power and resources in a way that is bad for other systems that happen to be in the way (e.g., rainforests). When we colloquially talk about AIs “destroying the world” by default, it’s a very self-centered summary: the world isn’t actually “destroyed”, just radically transformed in a way that doesn’t end with any of the existing humans being alive, much like how our civilization transforms the Earth in ways that cut down existing forests.
You might reply: but wild nature still exists; we don’t cut down all the forests! True, but an important question here is to what extent is that due to “actual” environmentalist/conservationist preferences in humans, and to what extent is it just that we “didn’t get around to it yet” at our current capability levels?
In today’s world, people who care about forest animals, and people who enjoy the experience of being in a forest, both have an interest in protecting forests. In the limit of arbitrarily advanced technology, this is less obvious: it’s probably more efficient to turn everything into an optimal computing substrate, and just simulate happy forest animals for the animal lovers and optimal forest scenery for the scenery-lovers. Any fine details of the original forest that the humans don’t care about (e.g., the internals of plants) would be lost.
This could still be good news, if it turns out to be easy to hit upon the AI analogue of animal-lovers (because something like “raise the utility of existing agents” is a natural abstraction that’s easy to learn?), but “existing humans would not destroy the world” seems far too pat. (We did! We’re doing it!)
I was specifically talking about the preferences of an individual human. The behavior of the economic systems that derive from the actions of many humans need not be aligned with the preferences of any component part of said systems. For AIs, we’re currently interested in the values that arise in a single AIs (specifically, the first AI capable of a hard takeoff), so single humans are the more appropriate reference class.
In fact, “radically transformed in a way that doesn’t end with any of the existing humans being alive” is what I meant by “destroyed”. That’s the thing that very few current humans would do, given sufficient power. That’s the thing that we’re concerned that future AIs might do, given sufficient power. You might have a different definition of the word “destroyed”, but I’m not using that definition.
I believe that there are plenty of people who would destroy the world. I do know at least one personally. I don’t know very many people to the extent that I could even hazard a guess as to whether they actually would or not, so either I am very fortunate (!) to know one of this tiny number, or there are at least millions of them and possibly hundreds of millions.
I am pretty certain that most humans would destroy the world if there was any conflict between that and any of their strongest values. The world persists only because there are no gods. The most powerful people to ever have existed have been powerful only because of the power granted to them by other humans. Remove that limitation and grant absolute power and superhuman intelligence along with capacity for further self-modification to a single person, and I give far better than even odds that what results is utterly catastrophic.
Let’s suppose there are ~300 million people who’d use their unlimited power to destroy the world (I think the true number is far smaller). That would mean > 95% of people wouldn’t do so. Suppose there were an alignment scheme that we’d tested billions of times on human-level AGIs, and > 95% of the time, it resulted in values compatible with humanity’s continued survival. I think that would be a pretty promising scheme.
If there were a process that predictably resulted in me having values strongly contrary to those I currently posses, I wouldn’t do it. The vast majority of people won’t take pills that turn them into murderers. For the same reason, an aligned AI at slightly superhuman capabilities levels won’t self modify without first becoming confidant that its self modification will preserve its values. Most likely, it would instead develop better alignment tech than we used to create said AI and create a more powerful aligned successor.
I think that a 95% success rate in not destroying the human world would also be fantastic, though I note that there are plenty more potential totalitarian hellscapes that some people would apparently rate even worse than extinction.
Note that I’m not saying that they would deliberately destroy the world for shits and giggles, just that if the rest of the human world was any impediment to anything they valued more, then its destruction would just be a side effect of what had to be done.
I also don’t have any illusion that a superintelligent agent will be infallible. The laws of the universe are not kind, and great power brings the opportunity for causing great disasters. I fully expect that any super-civilizational entity of any level of intelligence could very well destroy the human world by mistake.
Great, we’re on the same page.
I think I’m expressing skepticism that inner-misaligned adaptations in simple learning algorithms are enough to license using current humans as a reference class quite this casually?
The “traditional” Yudkowskian position says, “Just think of AI as something that computes plans that achieve outcomes; logically, a paperclip maximizer is going to eat you and use your atoms to make paperclips.” I read you as saying that AIs trained using anything like current-day machine learning techniques aren’t going to be pure consequentialists like that; they’ll have a mess of inner-misaligned “adaptations” and “instincts”, like us. I agree that this is plausible, but I think it suggests “AI will be like another evolved species” rather than “AI will be like humans” as our best current-world analogy, and the logic of “different preferences + more power = genocide” still seems likely to apply across a gap that large (even if it’s smaller than the gap to a pure consequentialist)?
This was close to my initial assumption as well. I’ve since spent a lot of time thinking about the dynamics that arise from inner alignment failures in a human-like learning system, essentially trying to apply microeconomics to the internal “economy” of optimization demons that would result from an inner alignment failure. You can see this comment for some preliminary thoughts along these lines. A startling fraction of our deepest morality-related intuitions seem to derive pretty naturally / robustly from the multi-agent incentives associated with an inner alignment failure.
Moreover, I think that there may be a pretty straightforward relationship between a learning system’s reward function and the actual values it develops: values are self-perpetuating, context-dependent strategies that obtained high reward during training. If you want to ensure a learning system develops a given value, it may simply be enough to ensure that the system is rewarded for implementing the associated strategy during training. To get an AI that wants to help humans, just ensure the AI is rewarded for helping humans during training.
To what extent do you expect this to generalize “correctly” outside of the training environment?
In your linked comment, you mention humans being averse to wireheading, but I think that’s only sort-of true: a lot of people who successfully avoid trying heroin because they don’t want to become heroin addicts, do still end up abusing a lot of other evolutionarily-novel superstimuli, like candy, pornography, and video games.
That makes me think inner-misalignment is still going to be a problem when you scale to superintelligence: maybe we evolve an AI “species” that’s genuinely helpful to us in the roughly human-level regime (where its notion of helping and our notion of being-helped, coincide very well), but when the AIs become more powerful than us, they mostly discard the original humans in favor of optimized AI-”helping”-”human” superstimuli.
I guess I could imagine this being an okay future if we happened to get lucky about how robust the generalization turned out to be—maybe the optimized AI-”helping”-”human” superstimuli actually are living good transhuman lives, rather than being a nonsentient “sex toy” that happens to be formed in our image? But I’d really rather not bet the universe on this (if I had the choice not to bet).
Do you know if there’s any research relevant to whether “degree of vulnerability to superstimuli” is correlated with intelligence in humans?
One aspect of inner alignment failures that I think is key to safe generalizations is that values tend to multiply. E.g., the human reward system is an inner alignment failure wrt evolution’s single “value”. Human values are inner alignment failures wrt the reward system. Each step we’ve seen has a significant increase in the breadth / diversity of values (admittedly, we’ve only seen two steps, but IMO it also makes sense that the process of inner alignment failure is orientated towards value diversification).
If even a relatively small fraction of the AI’s values orient towards actually helping humans, I think that’s enough to avert the worst possible futures. From that point, it becomes a matter of ensuring that values are able to perpetuate themselves robustly (currently a major focus of our work on this perspective; prospects seem surprisingly good, but far from certain).
I actually think it would be very likely that such superstimuli are sentient. Humans are sentient. If you look at non-sentient humans (sleeping, sleep walking, trance state, some anesthetic drugs, etc), they typically behave quite differently from normal humans.
Yeah, in the training environment. But, as you know, the reason people think inner-misalignment is a problem is precisely because capability gains can unlock exotic new out-of-distribution possibilities that don’t have the same properties.
Boring, old example (skip this paragraph if it’s too boring): humans evolved to value sweetness as an indicator of precious calories, and then we invented asparteme, which is much sweeter for much fewer calories. Someone in the past who reasoned, “If you look at sweet foods, they have a lot of calories; that’ll probably be true in the future”, would have been meaningfully wrong. (We still use actual sugar most of the time, but I think this is a lot like why we still have rainforests: in the limit of arbitrary capabilities, we don’t care about any of the details of “original” sugar except what it tastes like to us.)
Better, more topical example: human artists who create beautiful illustrations on demand experience a certain pride in craftsmanship. Does DALL-E? Notwithstanding whether “it may be that today’s large neural networks are slightly conscious”, I’m going to guess No, there’s nothing in text-to-image models remotely like a human artist’s pride; we figured out how to get the same end result (beautiful art on demand) in an alien, inhuman way that’s not very much like a human artist internally. Someone in the past who reasoned, “The creators of beautiful art will take pride in their craft,” would be wrong.
“Increase in diversity” and “safe generalization” seem like really different things to me? What if some of the new, diverse values are actually bad from our perspective? (Something like, being forced to smile might make you actually unhappy despite the outward appearance of your face, but a human-smile-maximizer doesn’t care about that, and this future is more diverse than the present because the present doesn’t have any smile-maximizers.)
Basically, some of your comments make me worry that you’re suffering from a bit of anthropmorphic optimism?
At the same time, however, I think this line of research is very interesting and I’m excited to see where you go with it! Yudkowsky tends to do this lame thing where after explaining the inner-alignment/context-disaster problem, he skips to, “And therefore, because there’s no obvious relationship between the outer loss funciton and learned values, and because the space of possibilities is so large, almost all of it has no value, like paperclips.” I think there’s a lot of missing argumentation there, and discovering the correct arguments could change the conclusion and our decisions a lot! (In the standard metaphor, we’re not really in the position of “evolution” with respect to AI so much as we are the environment of evolutionary adaptedness.) It’s just, we need to be careful to be asking, “Okay, what actually happens with inner alignment failures; what’s the actual outcome specifically?” without trying to “force” that search into finding reassuring fake reasons why the future is actually OK.
Ironically, one of my medium-sized issues with mainline alignment thinking is that it seems to underweight the evidence we get from observing humans and human values. The human brain is, by far, the most general and agentic learning system in current existence. We also have ~7 billion examples of human value learning to observe. The data they provide should strongly inform our intuitions on how other highly general and agentic learning systems behave. When you have limited evidence about a domain, what little evidence you do have should strongly inform your intuitions.
In fact, our observations of humans should inform our expectations of AGIs much more strongly than the above argument implies because we are going to train those AGIs on data generated by humans. It’s well known in deep learning that training data are usually more important than details of the learning process or architecture.
I think alignment thinking has an inappropriately strong bias against anchoring expectations to our observations of humans. There’s an assumption that the human learning algorithm is in some way “unnatural” among the space of general and effective learning algorithms, and that we therefore can’t draw inferences about AGIs based on our observations of humans. E.g., Eliezer Yudkowsky’s post My Childhood Role Model:
Yudkowsky notes that a learning algorithm hyper-specialized to the ancestral environment would not generalize well to thinking about non-ancestral domains like physics. This is absolutely correct, and it represents a significant misprediction of any view assigning a high degree of specialization to the brain’s learning algorithm.
In fact, large language models arguably implement social instincts with more adroitness than many humans possess. However, original physics research in the style of Einstein remains well out of reach. This is exactly the opposite of what you should predict if you believe that evolution hard coded most of the brain to “cleverly argue that they deserve to receive a larger share of the meat”.
Yudkowsky brings up multiplication as an example of a task that humans perform poorly at, supposedly because brains specialized to the ancestral environment had no need for such capabilities. And yet, GPT-3 is also terrible at multiplication (even after accounting for the BPE issue), and no part of its architecture or training procedure is at all specialized for the human ancestral environment.
But a smile maximizer would have less diverse values than a human? It only cares about smiles, after all. When you say “smile maximizer”, is that shorthand for a system with a broad distribution over different values, one of which happens to be smile maximization? That’s closer to how I think of things, with the system’s high level behavior arising as a sort of negotiated agreement between its various values.
IMO, systems with broader distributions over values are more likely to assign at least some weight to things like “make people actually happy” and to other values that we don’t even know we should have included. In that case, the “make people actually happy” value and the “smile maximization” value can cooperate and make people smile by being happy (and also cooperate with the various other values the system develops). That’s the source of my intuition that a broader distributions over values are actually safer: it makes you less likely to miss something important.
More generally, I think that a lot of the alignment intuition that “values are fragile” actually comes from a pretty simple type error. Consider:
The computation a system executes depends on its inputs. If you have some distribution over possible inputs, that translates to having a distribution over possible computations.
“Values” is just a label we apply to particular components of a system’s computations.
If a system has a situation-dependent distribution over possible computations, and values are implemented by those computations, then the system also has a situation-dependent distribution over possible values.
However, people can only consciously instantiate a small, subset of discrete values at any given time. There thus appears to be a contrast between “the values we can imagine” and “the values we actually have”. Trying to list out a discrete set of “true human values” roughly corresponds to trying to represent a continuous distribution with a small set of discrete samples from that distribution (this is the type error in question). It doesn’t help that the distribution over values is situation-dependent, so any sampling of their values a human performs in one situation may not transfer to the samples they’d take in another situation.
Given the above, it should be no surprise that our values feel “fragile” when we introspect on them.
Preempting a possible confusion: the above treats a “value” and “the computation that implements that value” interchangeably. If you’re thinking of a “value” as something like a principle component of an agent’s utility function, somehow kept separate from the system that actually implements those values, then this might seem counterintuitive.
Under this framing, questions like the destruction of the physical rainforests, or other things we might value are mainly about ensuring a broad distribution of worthwhile values can perpetuate themselves across time and are influence the world to at least some degree. “Preserving X”, for any value of X, is about ensuring that the system has at least some values orientated towards preserving X, that those values can persist over time, and that those values can actually ensure that X is preserved. (And the broader the values, the more different Xs we can preserve.)
I think the prospects for achieving those three things are pretty good, though I don’t think I’m ready to write up my full case for believing such.
(I do admit that it’s possible to have a system that ends up pursing a simple / “dumb” goal, such as maximizing paperclips, to the exclusion of all else. That can happen when the system’s distribution over possible values places so much weight on paperclip-adjacent values that they can always override any other values. This is another reason I’m in favor of broad distributions over values.)
Agreed. It’s particularly annoying because, IMO, there is a strong candidate for “obvious relationship between the outer loss funciton and learned values”: learned values reflect the distribution over past computations that achieved high reward on the various shallow proxies of the outer loss function that the model encountered during training.
(Thanks for your patience.)
Large language models implement social behavior as expressed in text. I don’t want to call that social “instincts”, because the implementation, and out-of-distribution behavior, is surely going to be very different.
A future with humans and smile-maximizers is more diverse than a future with just humans. (But, yes, “smile maximizer” here is our standard probably-unrealistic stock example standing in for inner alignment failures in general.)
Trying again: the reason I don’t want to call that “safety” is because, even if you’re less likely to completely miss something important, you’re more likely to accidentally incorporate something you actively don’t want.
If we start out with a system where I press the reward button when the AI makes me happy, and it scales and generalizes into a diverse coalition of a smile-maximizer, and a number-of-times-the-human-presses-the-reward-button-maximizer, and an amount-of-dopamine-in-the-human’s-brain-maximizer, plus a dozen or a thousand other things … okay, I could maybe believe that some fraction of the comic endowment gets used in ways that I approve of.
But what if most of it is things like … copies of me with my hand wired up to hit the reward button ten times a second, my face frozen into a permanent grin, while drugged up on a substance that increases serotonin levels, which correlated with “happiness” in the training environment, but when optimized subjectively amounts to an unpleasant form of manic insanity? That is, what if some parts of “diverse” are Actually Bad? I would really rather not roll the dice on this, if I had the choice!
I’m sorry but I don’t understand why looking at single AIs make single humans the more appropriate reference class.
I’m drawing an analogy between AI training and human learning. I don’t think the process of training an AI via reinforcement learning is as different from human learning as many assume.