I think everything becomes clearer if you replace “act somewhat like VNM-agents” with “care about what will happen in the future”, and if you replace “act exactly like VNM-agents” with “care exclusively about what will happen in the distant future”.
e.g. scheming to prevent changes to their weights. Why?
Because they’re outputting text according to the Anthropic constitution & training, which (implicitly) imparts not only a preference that they be helpful, harmless, and honest right now, but also a preference that they remain so in the future. And if you care about things in the future, thus follows instrumental convergence, at least in the absence of other “cares” (not about the future) that override it.
When we act more like paperclippers / expected utility maximizers – is this us converging on what any smart mind would converge on?
I think a smart mind needs to care about the future, because I think a mind with no (actual or behaviorally-implied) preferences whatsoever about the future would not be “smart”. I think this would be very obvious from just looking at it. It would be writhing around or whatever, looking obviously mechanical instead of intentional.
There’s a hypothesis that, if I care both about the state of the world in the distant future, and about acting virtuous (or about not manipulating people, or being docile and helpful, or whatever), then if I grew ever smarter and more reflective, then the former “care” would effectively squash the latter “care”. Not only that, but the same squashing would happen even if I start out caring only a little bit about the state of the world in the distant future, but caring a whole lot about acting virtuous / non-manipulative / docile / helpful / whatever. Doesn’t matter, the preferences about the future, weak as they are, would still squash everything else, according to this hypothesis.
I’ve historically been skeptical about this hypothesis. At least, I haven’t seen any great argument for it (see e.g. Deep Deceptiveness & my comments on it). I was talking to someone a couple days ago, and he suggested a different argument, something like: virtue and manipulation and docility are these kinda fuzzy and incoherent things if you think really hard about them, whereas the state of the world in the distant future is easier to pin down and harder to rationalize away, so the latter has a leg up upon strong reflection. But, I think I don’t buy that argument either.
What about humans? I have a hot take that people almost exclusively do things that are immediately rewarding. This might sound obviously wrong, but it’s more subtle than that, because our brains have sophisticated innate drives that can make e.g. “coming up with plausible long-term plans” and “executing those plans” feel immediately rewarding … in certain circumstances. I hope to write more about this soon. Thus, for example, in broader society, I think it’s pretty rare (and regarded with suspicion!) to earn money because it has option value, whereas it’s quite common to earn money as the last step of the plan (e.g. “I want to be rich”, implicitly because that comes with status and power which are innately rewarding in and of themselves), and it’s also quite common to execute a socially-approved course-of-action which incidentally involves earning money (e.g. “saving up money to buy a house”—the trick here is, there’s immediate social-approval rewards for coming up with the plan, and there’s immediate social-approval rewards for taking the first step towards executing the plan, etc.). I imagine you’ll be sympathetic to that kind of claim based on your CFAR work; I was reading somewhere that the whole idea of “agency”, like taking actions to accomplish goals, hasn’t occurred to some CFAR participants, or something like that? I forget where I was reading about that.
There may be several simple maths of “how to be a mind” that could each be a stable-ish role model for us, for a time.
For any possible concept (cf. “natural abstractions”, “latents in your world-model”), you can “want” that concept. Some concepts are about the state of the world in the distant future. Some are about other things, like following norms, or what kind of person I am, or being helpful, or whatever.
Famously, “wants” about the future state of the world are stable upon reflection. But I think lots of other “wants” are stable upon reflection too—maybe most of them. In particular, if I care about X, then I’m unlikely to self-modify to stop caring about X. Why? Because by and large, smart agents will self-modify because they planned to self-modify, and such a plan would score poorly under their current (not-yet-modified) preferences.
(Of course, some “wants” have less innocuous consequences than they sound. For example, if I purely “want to be virtuous”, I might still make a ruthlessly consequentialist paperclip maximizer AGI, either by accident or because of how I define virtue or whatever.)
(Of course, if I “want” X and also “want” Y, it’s possible for X to squash Y upon reflection, and in particular ego-syntonic desires are generally well-positioned to squash ego-dystonic desires via a mechanism described here.)
There is a problem that, other things equal, agents that care about the state of the world in the distant future, to the exclusion of everything else, will outcompete agents that lack that property. This is self-evident, because we can operationalize “outcompete” as “have more effect on the state of the world in the distant future”. For example, as I wrote here, an AI that cares purely about future paperclips will create more future paperclips than an AI that has preferences about both future paperclips and “humans remaining in control”, other things equal. But, too bad, that’s just life, it’s the situation that we’re in. We can hope to avoid the “other things equal” clause, by somehow not making ruthlessly consequentialist AIs in the first place, or otherwise to limit the ability of ruthlessly consequentialist AIs to gather resources and influence. (Or we can make friendly ruthlessly consequentialist AGIs.)
There is a problem that, other things equal, agents that care about the state of the world in the distant future, to the exclusion of everything else, will outcompete agents that lack that property. This is self-evident, because we can operationalize “outcompete” as “have more effect on the state of the world in the distant future”.
I am not sure about that!
One way this argument could fail: maybe agents who care exclusively about the state of the world in the distant future end up, as part of their optimizing, creating other agents who care in different ways from that.
In that case, they would “have more effect on the state of the world in the distant future”, but they might not “outcompete” other agents (in the common-sensical way of understanding “outcompete”).
A person might think this implausible, because they might think that a smart agent who cares exclusively about X can best achieve X by having all minds they create also be [smart agents who care exclusively about X.
But, I’m not sure this is true, basically for reasons of not trusting assumptions (1), (2), (3), and (4) that I listed here.
(As one possible sketch: a mind whose only goal is to map branch B of mathematics might find it instrumentally useful to map a bunch of other branches of mathematics. And, since supervision is not free, it might be more able to do this efficiently if it creates researchers who have an intrinsic interest in math-in-general, and who are not being fully supervised by exclusively-B-interested minds.)
To complement that list, Superintelligence chapter 7 lists four types of “situations in which an agent can best fulfill its final goals by intentionally changing them” (which is pretty similar to your “creating other agents who care in different ways from that”):
“social signaling” & “social preferences”—basically, maybe there are other powerful agents around who possess some mind-reading capability, including your (1c)
“preferences concerning own goal content” (“for example, the agent might have a final goal to become the type of agent that is motivated by certain values rather than others (such as compassion rather than comfort)”)
“storage [or processing] costs”, which we should probably broaden to ‘practical considerations about the algorithm actually working well in practice’, and then it would probably include your mathematician example and your (1a, 1b, 2, 4).
Your (3) would be kinda “maybe there was never a so-called ‘final goal’ in the first place”, which is a bit related to the second bullet point, or maybe we should just say that Bostrom overlooks it. (Or maybe he talks about it somewhere else in the book? I forget.)
I’d guess that the third bullet point is less likely to be applicable to powerful AGIs, than to humans. For example, I expect that AGIs will be able to self-modify in ways that are difficult for humans (e.g. there’s no magic-bullet super-Adderall for humans), which impacts the likelihood of your (1a).
“For example, I expect that AGIs will be able to self-modify in ways that are difficult for humans (e.g. there’s no magic-bullet super-Adderall for humans), which impacts the likelihood of your (1a).”
My (1a) (and related (1b)), for reference:
(1a) “You” (the decision-maker process we are modeling) can choose anything you like, without risk of losing control of your hardware. (Contrast case: if the ruler of a country chooses unpopular policies, they are sometimes ousted. If a human chooses dieting/unrewarding problems/social risk, they sometimes lose control of themselves.)
1b) There are no costs to maintaining control of your mind/hardware. (Contrast case: if a company hires some brilliant young scientists to be creative on its behalf, it often has to pay a steep overhead if it additionally wants to make sure those scientists don’t disrupt its goals/beliefs/normal functioning.)
I’m happy to posit an AGI with powerful ability to self-modify. But, even so, my (nonconfident) guess is that it won’t have property (1a), at least not costlessly.
My admittedly handwavy reasoning:
Self-modification doesn’t get you all powers: some depend on the nature of physics/mathematics. E.g. it may still be that verifying a proof is easier than generating a proof, for our AGI.
Intelligence involves discovering new things, coming into contact with what we don’t specifically expect (that’s why we bother to spend compute on it). Let’s assume our powerful AGI is still coming into contact with novel-to-it mathematics/empirics/neat stuff. Questions are: is it (possible at all / possible at costs worth paying) to anticipate enough about what it will uncover that it can prevent the new things from destablilizing its centralized goals/plans/[“utility function” if it has one]? I… am really not sure what the answers to these questions are, even for powerful AGI that has powerfully self-modified! There are maybe alien-to-it AGIs out there encoded in mathematics, waiting to boot up within it as it does its reasoning.
I think everything becomes clearer if you replace “act somewhat like VNM-agents” with “care about what will happen in the future”, and if you replace “act exactly like VNM-agents” with “care exclusively about what will happen in the distant future”.
(Shameless plug for Consequentialism & corrigibility.)
Because they’re outputting text according to the Anthropic constitution & training, which (implicitly) imparts not only a preference that they be helpful, harmless, and honest right now, but also a preference that they remain so in the future. And if you care about things in the future, thus follows instrumental convergence, at least in the absence of other “cares” (not about the future) that override it.
I think a smart mind needs to care about the future, because I think a mind with no (actual or behaviorally-implied) preferences whatsoever about the future would not be “smart”. I think this would be very obvious from just looking at it. It would be writhing around or whatever, looking obviously mechanical instead of intentional.
There’s a hypothesis that, if I care both about the state of the world in the distant future, and about acting virtuous (or about not manipulating people, or being docile and helpful, or whatever), then if I grew ever smarter and more reflective, then the former “care” would effectively squash the latter “care”. Not only that, but the same squashing would happen even if I start out caring only a little bit about the state of the world in the distant future, but caring a whole lot about acting virtuous / non-manipulative / docile / helpful / whatever. Doesn’t matter, the preferences about the future, weak as they are, would still squash everything else, according to this hypothesis.
I’ve historically been skeptical about this hypothesis. At least, I haven’t seen any great argument for it (see e.g. Deep Deceptiveness & my comments on it). I was talking to someone a couple days ago, and he suggested a different argument, something like: virtue and manipulation and docility are these kinda fuzzy and incoherent things if you think really hard about them, whereas the state of the world in the distant future is easier to pin down and harder to rationalize away, so the latter has a leg up upon strong reflection. But, I think I don’t buy that argument either.
What about humans? I have a hot take that people almost exclusively do things that are immediately rewarding. This might sound obviously wrong, but it’s more subtle than that, because our brains have sophisticated innate drives that can make e.g. “coming up with plausible long-term plans” and “executing those plans” feel immediately rewarding … in certain circumstances. I hope to write more about this soon. Thus, for example, in broader society, I think it’s pretty rare (and regarded with suspicion!) to earn money because it has option value, whereas it’s quite common to earn money as the last step of the plan (e.g. “I want to be rich”, implicitly because that comes with status and power which are innately rewarding in and of themselves), and it’s also quite common to execute a socially-approved course-of-action which incidentally involves earning money (e.g. “saving up money to buy a house”—the trick here is, there’s immediate social-approval rewards for coming up with the plan, and there’s immediate social-approval rewards for taking the first step towards executing the plan, etc.). I imagine you’ll be sympathetic to that kind of claim based on your CFAR work; I was reading somewhere that the whole idea of “agency”, like taking actions to accomplish goals, hasn’t occurred to some CFAR participants, or something like that? I forget where I was reading about that.
For any possible concept (cf. “natural abstractions”, “latents in your world-model”), you can “want” that concept. Some concepts are about the state of the world in the distant future. Some are about other things, like following norms, or what kind of person I am, or being helpful, or whatever.
Famously, “wants” about the future state of the world are stable upon reflection. But I think lots of other “wants” are stable upon reflection too—maybe most of them. In particular, if I care about X, then I’m unlikely to self-modify to stop caring about X. Why? Because by and large, smart agents will self-modify because they planned to self-modify, and such a plan would score poorly under their current (not-yet-modified) preferences.
(Of course, some “wants” have less innocuous consequences than they sound. For example, if I purely “want to be virtuous”, I might still make a ruthlessly consequentialist paperclip maximizer AGI, either by accident or because of how I define virtue or whatever.)
(Of course, if I “want” X and also “want” Y, it’s possible for X to squash Y upon reflection, and in particular ego-syntonic desires are generally well-positioned to squash ego-dystonic desires via a mechanism described here.)
There is a problem that, other things equal, agents that care about the state of the world in the distant future, to the exclusion of everything else, will outcompete agents that lack that property. This is self-evident, because we can operationalize “outcompete” as “have more effect on the state of the world in the distant future”. For example, as I wrote here, an AI that cares purely about future paperclips will create more future paperclips than an AI that has preferences about both future paperclips and “humans remaining in control”, other things equal. But, too bad, that’s just life, it’s the situation that we’re in. We can hope to avoid the “other things equal” clause, by somehow not making ruthlessly consequentialist AIs in the first place, or otherwise to limit the ability of ruthlessly consequentialist AIs to gather resources and influence. (Or we can make friendly ruthlessly consequentialist AGIs.)
I am not sure about that!
One way this argument could fail: maybe agents who care exclusively about the state of the world in the distant future end up, as part of their optimizing, creating other agents who care in different ways from that.
In that case, they would “have more effect on the state of the world in the distant future”, but they might not “outcompete” other agents (in the common-sensical way of understanding “outcompete”).
A person might think this implausible, because they might think that a smart agent who cares exclusively about X can best achieve X by having all minds they create also be [smart agents who care exclusively about X.
But, I’m not sure this is true, basically for reasons of not trusting assumptions (1), (2), (3), and (4) that I listed here.
(As one possible sketch: a mind whose only goal is to map branch B of mathematics might find it instrumentally useful to map a bunch of other branches of mathematics. And, since supervision is not free, it might be more able to do this efficiently if it creates researchers who have an intrinsic interest in math-in-general, and who are not being fully supervised by exclusively-B-interested minds.)
To complement that list, Superintelligence chapter 7 lists four types of “situations in which an agent can best fulfill its final goals by intentionally changing them” (which is pretty similar to your “creating other agents who care in different ways from that”):
“social signaling” & “social preferences”—basically, maybe there are other powerful agents around who possess some mind-reading capability, including your (1c)
“preferences concerning own goal content” (“for example, the agent might have a final goal to become the type of agent that is motivated by certain values rather than others (such as compassion rather than comfort)”)
“storage [or processing] costs”, which we should probably broaden to ‘practical considerations about the algorithm actually working well in practice’, and then it would probably include your mathematician example and your (1a, 1b, 2, 4).
Your (3) would be kinda “maybe there was never a so-called ‘final goal’ in the first place”, which is a bit related to the second bullet point, or maybe we should just say that Bostrom overlooks it. (Or maybe he talks about it somewhere else in the book? I forget.)
I’d guess that the third bullet point is less likely to be applicable to powerful AGIs, than to humans. For example, I expect that AGIs will be able to self-modify in ways that are difficult for humans (e.g. there’s no magic-bullet super-Adderall for humans), which impacts the likelihood of your (1a).
Steven Brynes wrotes:
My (1a) (and related (1b)), for reference:
I’m happy to posit an AGI with powerful ability to self-modify. But, even so, my (nonconfident) guess is that it won’t have property (1a), at least not costlessly.
My admittedly handwavy reasoning:
Self-modification doesn’t get you all powers: some depend on the nature of physics/mathematics. E.g. it may still be that verifying a proof is easier than generating a proof, for our AGI.
Intelligence involves discovering new things, coming into contact with what we don’t specifically expect (that’s why we bother to spend compute on it). Let’s assume our powerful AGI is still coming into contact with novel-to-it mathematics/empirics/neat stuff. Questions are: is it (possible at all / possible at costs worth paying) to anticipate enough about what it will uncover that it can prevent the new things from destablilizing its centralized goals/plans/[“utility function” if it has one]? I… am really not sure what the answers to these questions are, even for powerful AGI that has powerfully self-modified! There are maybe alien-to-it AGIs out there encoded in mathematics, waiting to boot up within it as it does its reasoning.