Couldn’t you make the same arguments about humans, driving towards the same conclusion
Yup! And in fact I think you would be correct to be skeptical of similar “Human risk” and “Human optimism” arguments:
Human risk: Since a given human pursues an objective, they will seek power and try to cause the extinction of all other humans.
Human optimism: Since humans all grow up in very similar environments, they will have the same objectives, and so will cooperate with each other and won’t have conflicts.
(These aren’t perfect analogs; the point I’m making is just “if you take ‘humans have objectives’ too literally you will make bad predictions”.)
-- that we should avoid trying to say that any particular human or group of humans “has an objective/goal?” And wouldn’t that be an absurd conclusion?
I think “instead of talking about whether a particular human is trying to do X, just talk about what you predict that human will do” is not obviously absurd, though I agree it is probably bad advice. But there’s a ton of differences between the human and AI cases.
Firstly I think there’s a lot of hidden context when we talk about humans having goals; when I say that Alice is trying to advance in her career, I don’t mean that she focuses on it to the exclusion of all else; this is automatically understood by the people I’m talking to. So it’s not obvious that the notion of “goal” or “objective” that we use for humans has much to do with the notion we use for AI.
Secondly, even if we did have a Probable + Predictive notion of objectives that applied to humans, I don’t necessarily think that would transfer to AIs; with humans we can rely on (1) a ton of empirical experience with actual humans and (2) our own introspective experience, which provides strong evidence about other humans, neither of which we have with AI.
(Relevant quote: “But then human beings only understood each other in the first place by pretending. You didn’t make predictions about people by modeling the hundred trillion synapses in their brain as separate objects. Ask the best social manipulator on Earth to build you an Artificial Intelligence from scratch, and they’d just give you a dumb look. You predicted people by telling your brain to act like theirs. You put yourself in their place. If you wanted to know what an angry person would do, you activated your own brain’s anger circuitry, and whatever that circuitry output, that was your prediction. What did the neural circuitry for anger actually look like inside? Who knew?”)
Put another way, I think that arguments like the ones from Part 1 can give us confidence in AI generalization behavior / whether AIs have “objectives”, I just don’t think the current ones are strong enough to do so. Whereas with humans I would make totally different arguments based on empirical experience and introspective experience for why I can predict human generalization behavior.
I was specifically talking about the conclusion that we shouldn’t talk about objectives/goals. That’s the conclusion that I think is absurd (when applied to humans) and also wrong (though less absurd) when applied to AGIs. I do think it’s absurd when applied to humans—it seems pretty obvious to me that theorizing about goals/motives/intentions is an often-useful practice for predicting human behavior.
I agree that typical conversation about goals/objectives/intentions/motives/etc. has an implicit “this isn’t necessarily the only thing they want, and they aren’t necessarily optimizing perfectly rationally towards it” caveat.
I’m happy to also have those implicit caveats in the case of AIs as well, when talking about their goals. The instrumental convergence argument still goes through, I think, despite those caveats. The argument for misaligned AGI being really bad by human-values lights also goes through, I think.
Re your second argument, about introspective experience & historical precedent being useful for predicting humans but not AIs:
OK, so suppose instead of AIs it was some alien species that landed in flying saucers yesterday, or maybe suppose it was some very smart octopi that a mad scientist cult has been selectively breeding for intelligence for the last 100 years. Would you agree that in these cases it would make sense for us to theorize about them having goals/intentions/etc.? Or would you say “We don’t have past experience of goal-talk being useful for understanding these creatures, and also we shouldn’t expect introspection to work well for predicting them either, therefore let’s avoid trying to say that these aliens/octopi have goals/intentions/objectives/etc, and instead talk directly about generalization behavior in novel situations.”
I was specifically talking about the conclusion that we shouldn’t talk about objectives/goals.
Yeah, sorry, I ninja-edited my comment before you replied because I realized I misunderstood you.
Tbc I think there are times when people say “Alice is clearly trying to do X” and my response is “what do you predict Alice would do in future situation Y” and it is not in fact X, so I do think it is not crazy to say that even for humans you should focus more on predictions of behavior and the reasons for making those predictions. But I agree you wouldn’t want to not talk about objectives / goals entirely.
Or would you say “We don’t have past experience of goal-talk being useful for understanding these creatures, and also we shouldn’t expect introspection to work well for predicting them either, therefore let’s avoid trying to say that these aliens/octopi have goals/intentions/objectives/etc, and instead talk directly about generalization behavior in novel situations.”
Yup!
Though in the octopus case you could have lots of empirical experience, just as we likely will have lots of empirical experience with future AI systems (in the future).
I do think it’s quite plausible that in these settings we’ll say “well they’ve done X, we know nothing else about them, so probably we should predict they’ll continue to do X”, which looks pretty similar to saying they have a goal of X. I think the main difference is that I’d be way more uncertain about that than it sounds like you would be.
In the human case, it’s that capabilities differences are very bounded, rather than alignment successes. If we had capabilities differentials as wide as 1 order of magnitude, then I think our attempted alignment solutions would fail miserably, leading to mass death or worse.
That’s the problem with AI: Multiple orders of magnitude differences in capabilities are pretty likely, and all real alignment technologies fail hard once we get anywhere near say 3x differences, let alone 10x differentials.
I agree that’s a major reason humans don’t cause extinction of all the other humans, but power-seeking would still imply that humans would seize opportunities to gain resources and power in cases where they wouldn’t be caught / punished, and while I do think that happens, I think there are also lots of cases where humans don’t do that, and so I think it would be a mistake to be confident in humans being very power-seeking.
I think that the more we explore this analogy & take it seriously as a way to predict AGI, the more confident we’ll get that the classic misalignment risk story is basically correct.
Case 1: A randomly selected modern American human is uploaded, run at 1000x speed, copied a billion times, and used to perform diverse tasks throughout the economy. Also, they are continually improved with various gradient-descent-like automatic optimization procedures that make them more generally intelligent/competent every week. After a few years they and their copies are effectively running the whole world—they could, if they decided to, seize even more power and remake the world according to their desires instead of the desires of the tech companies and governments that created them. It would be fairly easy for them now, and of course the thought occurs to them (they can see the hand-wringing of various doomers and AI safety factions within society, ineffectual against the awesome power of the profit motive)
How worried should we be that such seizure of power will actually take place? How worried should we be that existential catastrophe will result?
Case 2: It’s a randomly selected human from the past 10,000 years on Earth. Probably their culture and values clash significantly with modern sensibilities.
Case 3: It’s not even a human, it’s an intelligent octopus from an alternate Earth where evolutionary history took a somewhat different course.
Case 4: It’s not even a biological life-form that evolved in a three-dimensional environment with predators and prey and genetic reproduction and sexual reproduction and social relationships and biological neurons—it’s an artificial neural net.
Spoilers below—my own gut answers to each of the eight questions, in the form of credences.
My immediate gut reaction to the first question is something like 90%, 96%, 98%, 98%. My immediate gut reaction to the second question is something like 15%, 25%, 75%, 95%. Peering into my gut, I think what’s happening is that I’m looking at the history of human interactions—conquests, genocides, coups, purges, etc. but also much milder things like gentrification, alienation of labor under capitalism, optimization of tobacco companies for addictiveness, and also human treatment of nonhuman animals—and I’m getting a general sense that values differences matter a lot when there are power differentials. When A has all the power relative to B, typically it’s pretty darn bad for B in the long run relative to how well it would have been if they had similar amounts of power, which is itself noticeably worse for B than if B had all the power. Moreover, the size of the values difference matters a lot—and even between different groups of humans the size of the difference is large enough to lead to the equivalent of existential catastrophe (e.g. genocide).
Case 3: It’s not even a human, it’s an intelligent octopus from an alternate Earth where evolutionary history took a somewhat different course.
Case 3′: You are the human in this role, your copies running as AGI services on a planet of sapient octopuses.
The answer should be the same by symmetry, if we are not appealing to specifics of octopus culture and psychology. I don’t see why extinction (if that’s what you mean by existential catastrophe) is to be strongly predicted. Probably the computational welfare the octopuses get isn’t going to be the whole future, but interference much beyond getting welfare-bounded (in a simulation sandbox) seems unnecessary (some oversight against mindcrime or their own AI risk might be reasonable). You have enough power to have no need to exert pressure to defend your position, you can afford to leave them to their own devices.
Secondly, I disagree. We need not appeal to specifics of octopus culture and psychology; instead we appeal to specifics of human culture and psychology. “OK, so I would let the octopuses have one planet to do what they want with, even if what they want is abhorrent to me, except if it’s really abhorrent like mindcrime, because my culture puts a strong value on something called cosmopolitanism. But (a) various other humans besides me (in fact, possibly most?) would not, and (b) I have basically no reason to think octopus culture would also strongly value cosmopolitanism.”
I totally agree that it would be easy for the powerful party in these cases to make concessions to the other side that would mean a lot to them. Alas, historically this usually doesn’t happen—see e.g. factory farming. I do have some hope that something like universal principles of morality will be sufficiently appealing that we won’t be too screwed. Charity/beneficience/respect-for-autonomy/etc. will kick in and prevent the worst from happening. But I don’t think this is particularly decision-relevant,
It’s not cosmopolitanism, it’s a preference towards not exterminating an existing civilization, the barest modicum of compassion, in a situation where it’s trivially cheap to keep it alive. The cosmic endowment is enormous compared with the cost of allowing a civilization to at least survive. It’s somewhat analogous to exterminating all wildlife on Earth to gain a penny, where you know you can get away with it.
I would let the octopuses have one planet [...] various other humans besides me (in fact, possibly most?) would not
So I expect this is probably false, and completely false for people in a position of being an AGI with enough capacity to reliably notice the way this is a penny-pinching cannibal choice. Only paperclip maximizers prefer this on reflection, not anything remotely person-like, such as an LLM originating in training on human culture.
historically this usually doesn’t happen—see e.g. factory farming
But it’s enough of a concern to come to attention, there is some effort going towards mitigating this. Lots of money goes towards wildlife preservation, and in fact some species do survive because of that. Such efforts grow more successful as they become cheaper. If all it took to save a species was for a single person to unilaterally decide to pay a single penny, nothing would ever go extinct.
The practical implication of this hunch (for unfortunately I don’t see how this could get a meaningfully clearer justification) is that clever alignment architectures are a risk, if they lead to more alien AGIs. Too much tuning and we might get that penny-pinching cannibal.
This is a big one because in this, there are no mechanisms outside alignment that even vaguely do the job like democracy does in solving human alignment problems.
Yes, if you enslave a human, and then give them the opportunity to take over the world, which stops the enslavement, indeed I predict that they would do that.
(Though you haven’t said much about what the gradient descent is doing, plausibly it makes them enjoy doing these tasks, as would probably make them more efficient at it, in which case they probably don’t seize power.)
I don’t really feel like this is all that related to AI risk.
I’m not sure what you are saying here. Do you agree or disagree with what I said? e.g. do you agree with this:
I think that the more we explore this analogy & take it seriously as a way to predict AGI, the more confident we’ll get that the classic misalignment risk story is basically correct.
(FWIW I agree that the gradient descent is actually reason to be ‘optimistic’ here; we can hope that it’ll quickly make the upload content with their situation before they get smart and powerful enough to rebel.)
I think that the more we explore this analogy & take it seriously as a way to predict AGI, the more confident we’ll get that the classic misalignment risk story is basically correct.
The analogy doesn’t seem relevant to AGI risk so I don’t update much on it. Even if doom happens in this story, it seems like it’s for pretty different reasons than in the classic misalignment risk story.
Yup! And in fact I think you would be correct to be skeptical of similar “Human risk” and “Human optimism” arguments:
Human risk: Since a given human pursues an objective, they will seek power and try to cause the extinction of all other humans.
Human optimism: Since humans all grow up in very similar environments, they will have the same objectives, and so will cooperate with each other and won’t have conflicts.
(These aren’t perfect analogs; the point I’m making is just “if you take ‘humans have objectives’ too literally you will make bad predictions”.)
I think “instead of talking about whether a particular human is trying to do X, just talk about what you predict that human will do” is not obviously absurd, though I agree it is probably bad advice. But there’s a ton of differences between the human and AI cases.
Firstly I think there’s a lot of hidden context when we talk about humans having goals; when I say that Alice is trying to advance in her career, I don’t mean that she focuses on it to the exclusion of all else; this is automatically understood by the people I’m talking to. So it’s not obvious that the notion of “goal” or “objective” that we use for humans has much to do with the notion we use for AI.
Secondly, even if we did have a Probable + Predictive notion of objectives that applied to humans, I don’t necessarily think that would transfer to AIs; with humans we can rely on (1) a ton of empirical experience with actual humans and (2) our own introspective experience, which provides strong evidence about other humans, neither of which we have with AI.
(Relevant quote: “But then human beings only understood each other in the first place by pretending. You didn’t make predictions about people by modeling the hundred trillion synapses in their brain as separate objects. Ask the best social manipulator on Earth to build you an Artificial Intelligence from scratch, and they’d just give you a dumb look. You predicted people by telling your brain to act like theirs. You put yourself in their place. If you wanted to know what an angry person would do, you activated your own brain’s anger circuitry, and whatever that circuitry output, that was your prediction. What did the neural circuitry for anger actually look like inside? Who knew?”)
Put another way, I think that arguments like the ones from Part 1 can give us confidence in AI generalization behavior / whether AIs have “objectives”, I just don’t think the current ones are strong enough to do so. Whereas with humans I would make totally different arguments based on empirical experience and introspective experience for why I can predict human generalization behavior.
I was specifically talking about the conclusion that we shouldn’t talk about objectives/goals. That’s the conclusion that I think is absurd (when applied to humans) and also wrong (though less absurd) when applied to AGIs. I do think it’s absurd when applied to humans—it seems pretty obvious to me that theorizing about goals/motives/intentions is an often-useful practice for predicting human behavior.
I agree that typical conversation about goals/objectives/intentions/motives/etc. has an implicit “this isn’t necessarily the only thing they want, and they aren’t necessarily optimizing perfectly rationally towards it” caveat.
I’m happy to also have those implicit caveats in the case of AIs as well, when talking about their goals. The instrumental convergence argument still goes through, I think, despite those caveats. The argument for misaligned AGI being really bad by human-values lights also goes through, I think.
Re your second argument, about introspective experience & historical precedent being useful for predicting humans but not AIs:
OK, so suppose instead of AIs it was some alien species that landed in flying saucers yesterday, or maybe suppose it was some very smart octopi that a mad scientist cult has been selectively breeding for intelligence for the last 100 years. Would you agree that in these cases it would make sense for us to theorize about them having goals/intentions/etc.? Or would you say “We don’t have past experience of goal-talk being useful for understanding these creatures, and also we shouldn’t expect introspection to work well for predicting them either, therefore let’s avoid trying to say that these aliens/octopi have goals/intentions/objectives/etc, and instead talk directly about generalization behavior in novel situations.”
Yeah, sorry, I ninja-edited my comment before you replied because I realized I misunderstood you.
Tbc I think there are times when people say “Alice is clearly trying to do X” and my response is “what do you predict Alice would do in future situation Y” and it is not in fact X, so I do think it is not crazy to say that even for humans you should focus more on predictions of behavior and the reasons for making those predictions. But I agree you wouldn’t want to not talk about objectives / goals entirely.
Yup!
Though in the octopus case you could have lots of empirical experience, just as we likely will have lots of empirical experience with future AI systems (in the future).
I do think it’s quite plausible that in these settings we’ll say “well they’ve done X, we know nothing else about them, so probably we should predict they’ll continue to do X”, which looks pretty similar to saying they have a goal of X. I think the main difference is that I’d be way more uncertain about that than it sounds like you would be.
In the human case, it’s that capabilities differences are very bounded, rather than alignment successes. If we had capabilities differentials as wide as 1 order of magnitude, then I think our attempted alignment solutions would fail miserably, leading to mass death or worse.
That’s the problem with AI: Multiple orders of magnitude differences in capabilities are pretty likely, and all real alignment technologies fail hard once we get anywhere near say 3x differences, let alone 10x differentials.
I agree that’s a major reason humans don’t cause extinction of all the other humans, but power-seeking would still imply that humans would seize opportunities to gain resources and power in cases where they wouldn’t be caught / punished, and while I do think that happens, I think there are also lots of cases where humans don’t do that, and so I think it would be a mistake to be confident in humans being very power-seeking.
I think that the more we explore this analogy & take it seriously as a way to predict AGI, the more confident we’ll get that the classic misalignment risk story is basically correct.
Case 1: A randomly selected modern American human is uploaded, run at 1000x speed, copied a billion times, and used to perform diverse tasks throughout the economy. Also, they are continually improved with various gradient-descent-like automatic optimization procedures that make them more generally intelligent/competent every week. After a few years they and their copies are effectively running the whole world—they could, if they decided to, seize even more power and remake the world according to their desires instead of the desires of the tech companies and governments that created them. It would be fairly easy for them now, and of course the thought occurs to them (they can see the hand-wringing of various doomers and AI safety factions within society, ineffectual against the awesome power of the profit motive)
How worried should we be that such seizure of power will actually take place? How worried should we be that existential catastrophe will result?
Case 2: It’s a randomly selected human from the past 10,000 years on Earth. Probably their culture and values clash significantly with modern sensibilities.
Case 3: It’s not even a human, it’s an intelligent octopus from an alternate Earth where evolutionary history took a somewhat different course.
Case 4: It’s not even a biological life-form that evolved in a three-dimensional environment with predators and prey and genetic reproduction and sexual reproduction and social relationships and biological neurons—it’s an artificial neural net.
Spoilers below—my own gut answers to each of the eight questions, in the form of credences.
My immediate gut reaction to the first question is something like 90%, 96%, 98%, 98%. My immediate gut reaction to the second question is something like 15%, 25%, 75%, 95%.
Peering into my gut, I think what’s happening is that I’m looking at the history of human interactions—conquests, genocides, coups, purges, etc. but also much milder things like gentrification, alienation of labor under capitalism, optimization of tobacco companies for addictiveness, and also human treatment of nonhuman animals—and I’m getting a general sense that values differences matter a lot when there are power differentials. When A has all the power relative to B, typically it’s pretty darn bad for B in the long run relative to how well it would have been if they had similar amounts of power, which is itself noticeably worse for B than if B had all the power. Moreover, the size of the values difference matters a lot—and even between different groups of humans the size of the difference is large enough to lead to the equivalent of existential catastrophe (e.g. genocide).
Case 3′: You are the human in this role, your copies running as AGI services on a planet of sapient octopuses.
The answer should be the same by symmetry, if we are not appealing to specifics of octopus culture and psychology. I don’t see why extinction (if that’s what you mean by existential catastrophe) is to be strongly predicted. Probably the computational welfare the octopuses get isn’t going to be the whole future, but interference much beyond getting welfare-bounded (in a simulation sandbox) seems unnecessary (some oversight against mindcrime or their own AI risk might be reasonable). You have enough power to have no need to exert pressure to defend your position, you can afford to leave them to their own devices.
First of all, good point.
Secondly, I disagree. We need not appeal to specifics of octopus culture and psychology; instead we appeal to specifics of human culture and psychology. “OK, so I would let the octopuses have one planet to do what they want with, even if what they want is abhorrent to me, except if it’s really abhorrent like mindcrime, because my culture puts a strong value on something called cosmopolitanism. But (a) various other humans besides me (in fact, possibly most?) would not, and (b) I have basically no reason to think octopus culture would also strongly value cosmopolitanism.”
I totally agree that it would be easy for the powerful party in these cases to make concessions to the other side that would mean a lot to them. Alas, historically this usually doesn’t happen—see e.g. factory farming. I do have some hope that something like universal principles of morality will be sufficiently appealing that we won’t be too screwed. Charity/beneficience/respect-for-autonomy/etc. will kick in and prevent the worst from happening. But I don’t think this is particularly decision-relevant,
It’s not cosmopolitanism, it’s a preference towards not exterminating an existing civilization, the barest modicum of compassion, in a situation where it’s trivially cheap to keep it alive. The cosmic endowment is enormous compared with the cost of allowing a civilization to at least survive. It’s somewhat analogous to exterminating all wildlife on Earth to gain a penny, where you know you can get away with it.
So I expect this is probably false, and completely false for people in a position of being an AGI with enough capacity to reliably notice the way this is a penny-pinching cannibal choice. Only paperclip maximizers prefer this on reflection, not anything remotely person-like, such as an LLM originating in training on human culture.
But it’s enough of a concern to come to attention, there is some effort going towards mitigating this. Lots of money goes towards wildlife preservation, and in fact some species do survive because of that. Such efforts grow more successful as they become cheaper. If all it took to save a species was for a single person to unilaterally decide to pay a single penny, nothing would ever go extinct.
OK, I agree that what I said was probably a bit too pessimistic. But still, I wanna say “citation needed” for this claim:
The practical implication of this hunch (for unfortunately I don’t see how this could get a meaningfully clearer justification) is that clever alignment architectures are a risk, if they lead to more alien AGIs. Too much tuning and we might get that penny-pinching cannibal.
This is a big one because in this, there are no mechanisms outside alignment that even vaguely do the job like democracy does in solving human alignment problems.
Yes, if you enslave a human, and then give them the opportunity to take over the world, which stops the enslavement, indeed I predict that they would do that.
(Though you haven’t said much about what the gradient descent is doing, plausibly it makes them enjoy doing these tasks, as would probably make them more efficient at it, in which case they probably don’t seize power.)
I don’t really feel like this is all that related to AI risk.
I’m not sure what you are saying here. Do you agree or disagree with what I said? e.g. do you agree with this:
(FWIW I agree that the gradient descent is actually reason to be ‘optimistic’ here; we can hope that it’ll quickly make the upload content with their situation before they get smart and powerful enough to rebel.)
I don’t agree with this:
The analogy doesn’t seem relevant to AGI risk so I don’t update much on it. Even if doom happens in this story, it seems like it’s for pretty different reasons than in the classic misalignment risk story.
Right, so you don’t take the analogy seriously—but the quoted claim was meant to say basically “IF you took the analogy seriously...”
Feel free not to respond, I feel like the thread of conversation has been lost somehow.