I mostly agree with the reasoning here; thank you to Eliezer for posting it and explaining it clearly. It’s good to have all these reasons here in once place.
The one area I partly disagree with is Section B.1. As I understand it, the main point of B.1 is that we can’t guard against all of the problems that will crop up as AI grows more intelligent, because we can’t foresee all of those problems, because most of them will be “out-of-distribution,” i.e., not the kinds of problems where we have reasonable training data. A superintelligent AI will do strange things that wouldn’t have occurred to us, precisely because it’s smarter than we are, and some of those things will be dangerous enough to wipe out all human life.
I think this somewhat overstates the problem. If we tell an AI not to invent nanotechnology, not to send anything to protein labs, not to hack into all of the world’s computers, not to design weird new quantum particles, not to do 100 of the other most dangerous and weirdest things we can think of, and then ask it to generalize and learn not to do things of that sort and build avoidance of catastrophic danger as a category into its utility function...
And then we test whether the AI is actually doing these things and successfully using something like the human category of “catastrophe” when the AI is only slightly smarter than humans...
And then learn from those tests and honestly look at the failures and improve the AI’s catastrophe-avoidance skills based on what we learn...
Then the chances that that AI won’t immediately destroy the world seem to me to be much much larger than 0.1%. They’re still low, which is bad, but they’re not laughably insignificant, either, because if you make an honest, thoughtful, sustained effort to constrain the preferences of your successors, then often you at least partially succeed.
If natural selection had feelings, it might not be maximally happywith the way humans are behaving in the wake of Cro-Magnon optimization...but it probably wouldn’t call it a disaster, either. Despite the existence of contraception, there sure are a whole lot more Cro-Magnons than there ever were Neanderthals, and the population is still going up every year.
Similarly, training an AI to act responsibly isn’t going to get us a reliablysafe AI, but whoever launches the first super-intelligent AI puts enough effort into that kind of training, then I don’t see any reason why we shouldn’t expect at least a 50% chance of a million or better survivors. I’m much more worried about large, powerful organizations that “vocally disdain all talk of AGI safety” than I am about the possibility that AGI safety research is inherently futile. It’s inherently imperfect in that there’s no apparent path to guaranteeing the friendliness of superintelligence...but that’s not quite the same thing as saying that we shouldn’t expect to be able to increase the probability that superintelligence is at least marginally friendly.
If natural selection had feelings, it might not be maximally happywith the way humans are behaving in the wake of Cro-Magnon optimization...but it probably wouldn’t call it a disaster, either.
Out of a population of 8 billion humans, in a world that has known about Darwin for generations, very nearly zero are trying to directly manufacture large numbers of copies of their genomes—there is almost no creative generalization towards ‘make more copies of my genome’ as a goal in its own right.
Meanwhile, there is some creativity going into the proxy goal ‘have more babies’, and even more creativity going into the second-order proxy goal ‘have more sex’. But the net effect is that the world is becoming wealthier, and the wealthiest places are reliably choosing static or declining population sizes.
And if you wind the clock forward, you likely see humans transitioning into brain emulations (and then self-modifying a bunch), leaving DNA self-replicators behind entirely. (Or you see humans replacing themselves with AGIs. But it would be question-begging to cite this particular prediction here, though it is yet another way humans are catastrophically straying from what human natural selection ‘wanted’.)
Right, I’m not claiming that AGI will do anything like straightforwardly maximize human utility. I’m claiming that if we work hard enough at teaching it to avoid disaster, it has a significant chance of avoiding disaster.
The fact that nobody is artificially mass-producing their genes is not a disaster from Darwin’s point of view; Darwin is vaguely satisfied that instead of a million humans there are now 7 billion humans. If the population stabilizes at 11 billion, that is also not a Darwinian disaster. If the population spreads across the galaxy, mostly in the form of emulations and AIs, but with even 0.001% of sentient beings maintaining some human DNA as a pet or a bit of nostalgia, that’s still way more copies of our DNA than the Neanderthals were ever going to get.
There are probably some really convincing analogies or intuition pumps somewhere that show that values are likely to be obliterated after a jump in intelligence, but I really don’t think evolution/contraception is one of those analogies.
I’m claiming that if we work hard enough at teaching it to avoid disaster, it has a significant chance of avoiding disaster.
As stated, I think Eliezer and I, and nearly everyone else, would agree with this.
The fact that nobody is artificially mass-producing their genes is not a disaster from Darwin’s point of view; Darwin is vaguely satisfied that instead of a million humans there are now 7 billion humans.
?? Why would human natural selection be satisfied with 7 billion but not satisfied with a million? Seems like you could equally say ‘natural selection is satisfied with a million, since at least a million is higher than a thousand’. Or ‘natural selection is satisfied with a hundred, since at least a hundred is higher than fifty’.
I understand the idea of extracting from a population’s process of natural selection a pseudo-goal, ‘maximize inclusive genetic fitness’; I don’t understand the idea of adding that natural selection has some threshold where it ‘feels’ ‘satisfied’.
Sure, the metaphor is strained because natural selection doesn’t have feelings, so it’s never going to feel satisfied, because it’s never going to feel anything. For whatever it’s worth, I didn’t pick that metaphor; Eliezer mentions contraception in his original post.
As I understand it, the point of bringing up contraception is to show that when you move from one level of intelligence to another, much higher level of intelligence, then the more intelligent agent can wind up optimizing for values that would be anathema to the less intelligent agents, even if the less intelligent agents have done everything they can to pass along their values. My objection to this illustration is that I don’t think anyone’s demonstrated that human goals could plausibly be described as “anathema” to natural selection. Overall, humans are pursuing a set of goals that are relatively well-aligned with natural selection’s pseudo-goals.
Why do you think the goal of evolution is “more copies of genome” rather than “more babies”? To the extent that evolution can be said to have a goal, I think “more babies” is closer—e.g. imagine a mutation that caused uncontrolled DNA replication within a cell. That would lead to lots of copies of its genome but not more reproductive fitness (Really, I guess this means that you need to specify which evolution you’re talking about—I think the evolution for healthy adult humans has “babies who grow to adulthood” as its goal)
w.r.t. declining population sizes, I think it’s likely we would return to malthusianism after a few more generations of genetic/cultural selection under modern conditions. Although as you say the singularity is going to come before that can happen.
Yeah, but the population is still pretty large and could become much larger if we become intergalactic. And possibly this is more likely than if we were at the Malthusian limits.
If we tell an AI not to invent nanotechnology, not to send anything to protein labs, not to hack into all of the world’s computers, not to design weird new quantum particles, not to do 100 of the other most dangerous and weirdest things we can think of, and then ask it to generalize and learn not to do things of that sort
I had the exact same thought. My guess would be that Eliezer might say that since the AI is maximising if the generalisation function misses even one action of this sort as something that we should exclude that we’re screwed.
Sure, I agree! If we miss even one such action, we’re screwed. My point is that if people put enough skill and effort into trying to catch all such actions, then there is a significant chance that they’ll catch literally all the actions that are (1) world-ending and that (2) the AI actually wants to try.
There’s also a significant chance we won’t, which is quite bad and very alarming, hence people should work on AI safety.
One of my assumptions is that it’s possible to design a “satisficing” engine—an algorithm that generates candidate proposals for a fixed number of cycles, and then, assuming at least one proposal with estimated utility greater than X has been generated within that amount of time, selects one of the qualifying proposals at random. If there are no qualifying candidates, the AI takes no action.
If you have a straightforward optimizer that always returns the action with the highest expected utility, then, yeah, you only have to miss one “cheat” that improves “official” utility at the expense of murdering everyone everywhere and then we all die. But if you have a satisficer, then as long as some of the qualifying plans don’t kill everyone, there’s a reasonable chance that the AI will pick one of those plans. Even if you forget to explicitly penalize one of the pathways to disaster, there’s no special reason why that one pathway would show up in a large majority of the AI’s candidate plans.
I suspect we’re talking about two different things.
If you just naively program a super-intelligent AI to satisfice a goal, then, sure, most of the candidate pathways to satisfice will involve accruing a lot of some type of power, because power is useful for achieving goals. That’s a valid point, and it’s important to understand that merely switching from optimizers to satisficers won’t adequately protect us against overly ambitious AIs.
However, that doesn’t mean that it’s futile to explicitly penalize most (but not literally all) of the paths that the AI could take to accumulate too much power. Suppose you adequately penalize all of the tactics that would have catastrophic side effects except for, oh, I don’t know, cornering the world market on all types of grain, because you forgot to think of that one particular path to power. Would most of the candidate plans that the AI submits for human approval then turn out to involve secretly cornering the grain market?
I don’t see why they would. All else being equal, sure, the most powerful tactic available is going to be somewhat more attractive than other, less powerful tactics. But how much more attractive? Would an AI generate 1,000x more plans that involve one specific globally disruptive tactic like cornering the grain market vs. all of the various tactics that are only locally disruptive, like publishing propaganda or hiring away competing researchers or hacking into competing research labs or interrupting the flow of electricity to those labs? 10x more such plans? 2x more such plans? I don’t think that’s the kind of estimate you can make just by gesturing at basic principles of game theory; you’d need some concrete domain knowledge about the AI’s specific planning algorithms.
If the truly dangerous plans are only 10x more common on the initial brainstorming list, then we can probably make sure those plans aren’t chosen by filtering for plans that the AI rates as safe and transparent. We can use 3 different AIs on airgapped equipment—one that only “cares” about evaluating whether a plan is safe (i.e., would the plan cause unacceptably large changes if implemented?), one that only “cares” about evaluating whether a plan is transparent (i.e., can the plan be explained to humans so that the humans will have an accurate understanding of what actions will be taken as part of the plan?), and one that generates plans that satisfice for a substantive goal (in this case, shutting down opposing AI research).
Of the plans that are approved by all 3 AIs, most of them probably won’t involve accumulating excessive amounts of power, because usually if you accumulate too much power, then that is (a) unsafe, and (b) likely to involve lying to ‘your’ human programmers.
If powerful plans are 10x more common on the initial brainstorming list, then you need filters for safety and/or transparency that are at least 90% effective at catching unacceptably powerful plans in order to get back up to a 50% chance of humanity surviving. If powerful plans are more than 10x more common, then you need a set of filters that’s more reliable than 90%. I don’t think we have any idea what those numbers are yet, but I do think it’s worth trying to reduce how common it is for excessively powerful plans to show up on the initial brainstorming list, and I think we can do that by training AIs to recognize dangerously disruptive plans and to try to avoid those types of plans. It’s better to at least try to get AIs to engage with the concept of “this plan is too disruptive” then to throw up our hands and say, “Oh, power is an attractor in game theory space, so there’s no possible way to get brilliant AIs that don’t seize infinite power.”
I mostly agree with the reasoning here; thank you to Eliezer for posting it and explaining it clearly. It’s good to have all these reasons here in once place.
The one area I partly disagree with is Section B.1. As I understand it, the main point of B.1 is that we can’t guard against all of the problems that will crop up as AI grows more intelligent, because we can’t foresee all of those problems, because most of them will be “out-of-distribution,” i.e., not the kinds of problems where we have reasonable training data. A superintelligent AI will do strange things that wouldn’t have occurred to us, precisely because it’s smarter than we are, and some of those things will be dangerous enough to wipe out all human life.
I think this somewhat overstates the problem. If we tell an AI not to invent nanotechnology, not to send anything to protein labs, not to hack into all of the world’s computers, not to design weird new quantum particles, not to do 100 of the other most dangerous and weirdest things we can think of, and then ask it to generalize and learn not to do things of that sort and build avoidance of catastrophic danger as a category into its utility function...
And then we test whether the AI is actually doing these things and successfully using something like the human category of “catastrophe” when the AI is only slightly smarter than humans...
And then learn from those tests and honestly look at the failures and improve the AI’s catastrophe-avoidance skills based on what we learn...
Then the chances that that AI won’t immediately destroy the world seem to me to be much much larger than 0.1%. They’re still low, which is bad, but they’re not laughably insignificant, either, because if you make an honest, thoughtful, sustained effort to constrain the preferences of your successors, then often you at least partially succeed.
If natural selection had feelings, it might not be maximally happy with the way humans are behaving in the wake of Cro-Magnon optimization...but it probably wouldn’t call it a disaster, either. Despite the existence of contraception, there sure are a whole lot more Cro-Magnons than there ever were Neanderthals, and the population is still going up every year.
Similarly, training an AI to act responsibly isn’t going to get us a reliably safe AI, but whoever launches the first super-intelligent AI puts enough effort into that kind of training, then I don’t see any reason why we shouldn’t expect at least a 50% chance of a million or better survivors. I’m much more worried about large, powerful organizations that “vocally disdain all talk of AGI safety” than I am about the possibility that AGI safety research is inherently futile. It’s inherently imperfect in that there’s no apparent path to guaranteeing the friendliness of superintelligence...but that’s not quite the same thing as saying that we shouldn’t expect to be able to increase the probability that superintelligence is at least marginally friendly.
Out of a population of 8 billion humans, in a world that has known about Darwin for generations, very nearly zero are trying to directly manufacture large numbers of copies of their genomes—there is almost no creative generalization towards ‘make more copies of my genome’ as a goal in its own right.
Meanwhile, there is some creativity going into the proxy goal ‘have more babies’, and even more creativity going into the second-order proxy goal ‘have more sex’. But the net effect is that the world is becoming wealthier, and the wealthiest places are reliably choosing static or declining population sizes.
And if you wind the clock forward, you likely see humans transitioning into brain emulations (and then self-modifying a bunch), leaving DNA self-replicators behind entirely. (Or you see humans replacing themselves with AGIs. But it would be question-begging to cite this particular prediction here, though it is yet another way humans are catastrophically straying from what human natural selection ‘wanted’.)
Right, I’m not claiming that AGI will do anything like straightforwardly maximize human utility. I’m claiming that if we work hard enough at teaching it to avoid disaster, it has a significant chance of avoiding disaster.
The fact that nobody is artificially mass-producing their genes is not a disaster from Darwin’s point of view; Darwin is vaguely satisfied that instead of a million humans there are now 7 billion humans. If the population stabilizes at 11 billion, that is also not a Darwinian disaster. If the population spreads across the galaxy, mostly in the form of emulations and AIs, but with even 0.001% of sentient beings maintaining some human DNA as a pet or a bit of nostalgia, that’s still way more copies of our DNA than the Neanderthals were ever going to get.
There are probably some really convincing analogies or intuition pumps somewhere that show that values are likely to be obliterated after a jump in intelligence, but I really don’t think evolution/contraception is one of those analogies.
As stated, I think Eliezer and I, and nearly everyone else, would agree with this.
?? Why would human natural selection be satisfied with 7 billion but not satisfied with a million? Seems like you could equally say ‘natural selection is satisfied with a million, since at least a million is higher than a thousand’. Or ‘natural selection is satisfied with a hundred, since at least a hundred is higher than fifty’.
I understand the idea of extracting from a population’s process of natural selection a pseudo-goal, ‘maximize inclusive genetic fitness’; I don’t understand the idea of adding that natural selection has some threshold where it ‘feels’ ‘satisfied’.
Sure, the metaphor is strained because natural selection doesn’t have feelings, so it’s never going to feel satisfied, because it’s never going to feel anything. For whatever it’s worth, I didn’t pick that metaphor; Eliezer mentions contraception in his original post.
As I understand it, the point of bringing up contraception is to show that when you move from one level of intelligence to another, much higher level of intelligence, then the more intelligent agent can wind up optimizing for values that would be anathema to the less intelligent agents, even if the less intelligent agents have done everything they can to pass along their values. My objection to this illustration is that I don’t think anyone’s demonstrated that human goals could plausibly be described as “anathema” to natural selection. Overall, humans are pursuing a set of goals that are relatively well-aligned with natural selection’s pseudo-goals.
Why do you think the goal of evolution is “more copies of genome” rather than “more babies”? To the extent that evolution can be said to have a goal, I think “more babies” is closer—e.g. imagine a mutation that caused uncontrolled DNA replication within a cell. That would lead to lots of copies of its genome but not more reproductive fitness (Really, I guess this means that you need to specify which evolution you’re talking about—I think the evolution for healthy adult humans has “babies who grow to adulthood” as its goal)
w.r.t. declining population sizes, I think it’s likely we would return to malthusianism after a few more generations of genetic/cultural selection under modern conditions. Although as you say the singularity is going to come before that can happen.
Yeah, but the population is still pretty large and could become much larger if we become intergalactic. And possibly this is more likely than if we were at the Malthusian limits.
I had the exact same thought. My guess would be that Eliezer might say that since the AI is maximising if the generalisation function misses even one action of this sort as something that we should exclude that we’re screwed.
Sure, I agree! If we miss even one such action, we’re screwed. My point is that if people put enough skill and effort into trying to catch all such actions, then there is a significant chance that they’ll catch literally all the actions that are (1) world-ending and that (2) the AI actually wants to try.
There’s also a significant chance we won’t, which is quite bad and very alarming, hence people should work on AI safety.
Hmm… It seems much, much harder to catch every single one than to catch 99%.
One of my assumptions is that it’s possible to design a “satisficing” engine—an algorithm that generates candidate proposals for a fixed number of cycles, and then, assuming at least one proposal with estimated utility greater than X has been generated within that amount of time, selects one of the qualifying proposals at random. If there are no qualifying candidates, the AI takes no action.
If you have a straightforward optimizer that always returns the action with the highest expected utility, then, yeah, you only have to miss one “cheat” that improves “official” utility at the expense of murdering everyone everywhere and then we all die. But if you have a satisficer, then as long as some of the qualifying plans don’t kill everyone, there’s a reasonable chance that the AI will pick one of those plans. Even if you forget to explicitly penalize one of the pathways to disaster, there’s no special reason why that one pathway would show up in a large majority of the AI’s candidate plans.
There is a special reason, and it’s called “instrumental convergence.” Satisficers tend to seek power.
I suspect we’re talking about two different things.
If you just naively program a super-intelligent AI to satisfice a goal, then, sure, most of the candidate pathways to satisfice will involve accruing a lot of some type of power, because power is useful for achieving goals. That’s a valid point, and it’s important to understand that merely switching from optimizers to satisficers won’t adequately protect us against overly ambitious AIs.
However, that doesn’t mean that it’s futile to explicitly penalize most (but not literally all) of the paths that the AI could take to accumulate too much power. Suppose you adequately penalize all of the tactics that would have catastrophic side effects except for, oh, I don’t know, cornering the world market on all types of grain, because you forgot to think of that one particular path to power. Would most of the candidate plans that the AI submits for human approval then turn out to involve secretly cornering the grain market?
I don’t see why they would. All else being equal, sure, the most powerful tactic available is going to be somewhat more attractive than other, less powerful tactics. But how much more attractive? Would an AI generate 1,000x more plans that involve one specific globally disruptive tactic like cornering the grain market vs. all of the various tactics that are only locally disruptive, like publishing propaganda or hiring away competing researchers or hacking into competing research labs or interrupting the flow of electricity to those labs? 10x more such plans? 2x more such plans? I don’t think that’s the kind of estimate you can make just by gesturing at basic principles of game theory; you’d need some concrete domain knowledge about the AI’s specific planning algorithms.
If the truly dangerous plans are only 10x more common on the initial brainstorming list, then we can probably make sure those plans aren’t chosen by filtering for plans that the AI rates as safe and transparent. We can use 3 different AIs on airgapped equipment—one that only “cares” about evaluating whether a plan is safe (i.e., would the plan cause unacceptably large changes if implemented?), one that only “cares” about evaluating whether a plan is transparent (i.e., can the plan be explained to humans so that the humans will have an accurate understanding of what actions will be taken as part of the plan?), and one that generates plans that satisfice for a substantive goal (in this case, shutting down opposing AI research).
Of the plans that are approved by all 3 AIs, most of them probably won’t involve accumulating excessive amounts of power, because usually if you accumulate too much power, then that is (a) unsafe, and (b) likely to involve lying to ‘your’ human programmers.
If powerful plans are 10x more common on the initial brainstorming list, then you need filters for safety and/or transparency that are at least 90% effective at catching unacceptably powerful plans in order to get back up to a 50% chance of humanity surviving. If powerful plans are more than 10x more common, then you need a set of filters that’s more reliable than 90%. I don’t think we have any idea what those numbers are yet, but I do think it’s worth trying to reduce how common it is for excessively powerful plans to show up on the initial brainstorming list, and I think we can do that by training AIs to recognize dangerously disruptive plans and to try to avoid those types of plans. It’s better to at least try to get AIs to engage with the concept of “this plan is too disruptive” then to throw up our hands and say, “Oh, power is an attractor in game theory space, so there’s no possible way to get brilliant AIs that don’t seize infinite power.”
You mean quantilization? Oh yeah, I forgot about that. Good point.