However, the link from instrumentally convergent goals to dangerous influence-seeking is only applicable to agents which have final goals large-scale enough to benefit from these instrumental goals
and not address the “Riemman disaster” or “Paperclip maximizer” examples [1]
Riemann hypothesis catastrophe. An AI, given the final goal of evaluating the Riemann hypothesis, pursues this goal by transforming the Solar System into “computronium” (physical resources arranged in a way that is optimized for computation)— including the atoms in the bodies of whomever once cared about the answer.
Paperclip AI. An AI, designed to manage production in a factory, is given the final goal of maximizing the manufacture of paperclips, and proceeds by converting first the Earth and then increasingly large chunks of the observable universe into paperclips.
Do you think that the argument motivating these examples is invalid?
Do you disagree with the claim that even systems with very modest and specific goals will have incentives to seek influence to perform their tasks better?
Do you think that the argument motivating these examples is invalid?
Yes, because it skips over the most important part: what it means to “give an AI a goal”. For example, perhaps we give the AI positive reward every time it solves a maths problem, but it never has a chance to seize more resources during training—all it’s able to do is think about them. Have we “given it” the goal of solving maths problems by any means possible, or the goal of solving maths problems by thinking about them? The former I’d call large-scale, the latter I wouldn’t.
I think I’ll concede that “large-scale” is maybe a bad word for the concept I’m trying to point to, because it’s not just a property of the goal, it’s a property of how the agent thinks about the goal too. But the idea I want to put forward is something like: if I have the goal of putting a cup on a table, there’s a lot of implicit context around which table I’m thinking about, which cup I’m thinking about, and what types of actions I’m thinking about. If for some reason I need to solve world peace in order to put the cup on the table, I won’t adopt solving world peace as an instrumental goal, I’ll just shrug and say “never mind then, I’ve hit a crazy edge case”. I don’t think that’s because I have safe values. Rather, this is just how thinking works—concepts are contextual, and it’s clear when the context has dramatically shifted.
So I guess I’m kind of thinking of large-scale goals as goals that have a mental “ignore context” tag attached. And these are certainly possible, some humans have them. But it’s also possible to have exactly the same goal, but only defined within “reasonable” boundaries—and given the techniques we’ll be using to train AGIs, I’m pretty uncertain which one will happen by default. Seems like, when we’re talking about tasks like “manage this specific factory” or “solve this specific maths problem”, the latter is more natural.
In the first paragraph you are saying that “seeking influence” is not something that a system will learn to do if that was not a possible strategy in the training regime. (but couldn’t it appear as an emergent property? Certainly humans were not trained to launch rockets—but they nevertheless did?)
In the second paragraph you are saying that common sense sometimes allows you to modify the goals you were given (but for this to apply to AI ststems, wouldn’t they need have common sense in the first place, which kind of assumes that the AI is already aligned?)
In the third paragraph it seems to me that you are saying that humans have some goals that have an built-in override mechanism in them—eg in general humans have a goal of eating delicious cake, but they will forego this goal in the interest of seeking water if they are about ot die of dehydratation (but doesn’t this seem to be a consequence of these goals being just instrumental things that proxy the complex thing that humans actually care about?)
I think I am confused because I do not understand your overall point, so the three paragraphs seem to be saying wildly different things to me.
Hey, thanks for the questions! It’s a very confusing topic so I definitely don’t have a fully coherent picture of it myself. But my best attempt at a coherent overall point:
In the first paragraph you are saying that “seeking influence” is not something that a system will learn to do if that was not a possible strategy in the training regime.
No, I’m saying that giving an agent a goal, in the context of modern machine learning, involves reinforcement in the training regime. It’s not clear to me exactly what goals will result from this, but we can’t just assume that we can “give an AI the final goal of evaluating the Riemann hypothesis” in a way that’s devoid of all context.
you are saying that common sense sometimes allows you to modify the goals you were given (but for this to apply to AI systems, wouldn’t they need have common sense in the first place, which kind of assumes that the AI is already aligned?)
It may be the case that it’s very hard to train AIs without common sense of some kind, potentially because a) that’s just the default for how minds work, they don’t by default extrapolate to crazy edge cases. And b) common sense is very useful in general. For example, if you train AIs on obeying human instructions, then they will only do well in the training environment if they have a common-sense understanding of what humans mean.
humans have some goals that have an built-in override mechanism in them
No, it’s more that the goal itself is only defined in a small-scale setting, because the agent doesn’t think in ways which naturally extrapolate small-scale goals to large scales.
Perhaps it’s useful to think about having the goal of getting a coffee. And suppose there is some unusual action you can take to increase the chances that you get the coffee by 1%. For example, you could order ten coffees instead of one coffee, to make sure at least one of them arrives. There are at least two reasons you might not take this unusual action. In some cases it goes against your values—for example, if you want to save money. But even if that’s not true, you might just not think about what you’re doing as “ensure that I have coffee with maximum probability”, but rather just “get a coffee”. This goal is not high-stakes enough for you to actually extrapolate beyond the standard context. And then some people are just like that with all their goals—so why couldn’t an AI be too?
I think this helped me a lot understand you a bit better—thank you
Let me try paraphrasing this:
> Humans are our best example of a sort-of-general intelligence. And humans have a lazy, satisfying, ‘small-scale’ kind of reasoning that is mostly only well suited for activities close to their ‘training regime’. Hence AGIs may also be the same—and in particular if AGIs are trained with Reinforcement Learning and heavily rewarded for following human intentions this may be a likely outcome.
Have we “given it” the goal of solving maths problems by any means possible, or the goal of solving maths problems by thinking about them?
The distinction that you’re pointing at is useful. But I would have filed it under “difference in the degree of agency”, not under “difference in goals”. When reading the main text, I thought this to be the reason why you introduced the six criteria of agency.
E.g., System A tries to prove the Riemann hypothesis by thinking about the proof. System B first seizes power and converts the galaxy into a supercomputer, to then prove the Riemann hypothesis. Both systems maybe have the goal of “proving the Riemann hypothesis”, but System B has “more agency”: it certainly has self-awareness, considers more sophisticated and diverse plans of larger scale, and so on.
I notice I am surprised you write
and not address the “Riemman disaster” or “Paperclip maximizer” examples [1]
Do you think that the argument motivating these examples is invalid?
Do you disagree with the claim that even systems with very modest and specific goals will have incentives to seek influence to perform their tasks better?
Yes, because it skips over the most important part: what it means to “give an AI a goal”. For example, perhaps we give the AI positive reward every time it solves a maths problem, but it never has a chance to seize more resources during training—all it’s able to do is think about them. Have we “given it” the goal of solving maths problems by any means possible, or the goal of solving maths problems by thinking about them? The former I’d call large-scale, the latter I wouldn’t.
I think I’ll concede that “large-scale” is maybe a bad word for the concept I’m trying to point to, because it’s not just a property of the goal, it’s a property of how the agent thinks about the goal too. But the idea I want to put forward is something like: if I have the goal of putting a cup on a table, there’s a lot of implicit context around which table I’m thinking about, which cup I’m thinking about, and what types of actions I’m thinking about. If for some reason I need to solve world peace in order to put the cup on the table, I won’t adopt solving world peace as an instrumental goal, I’ll just shrug and say “never mind then, I’ve hit a crazy edge case”. I don’t think that’s because I have safe values. Rather, this is just how thinking works—concepts are contextual, and it’s clear when the context has dramatically shifted.
So I guess I’m kind of thinking of large-scale goals as goals that have a mental “ignore context” tag attached. And these are certainly possible, some humans have them. But it’s also possible to have exactly the same goal, but only defined within “reasonable” boundaries—and given the techniques we’ll be using to train AGIs, I’m pretty uncertain which one will happen by default. Seems like, when we’re talking about tasks like “manage this specific factory” or “solve this specific maths problem”, the latter is more natural.
Let me try to paraphrase this:
In the first paragraph you are saying that “seeking influence” is not something that a system will learn to do if that was not a possible strategy in the training regime. (but couldn’t it appear as an emergent property? Certainly humans were not trained to launch rockets—but they nevertheless did?)
In the second paragraph you are saying that common sense sometimes allows you to modify the goals you were given (but for this to apply to AI ststems, wouldn’t they need have common sense in the first place, which kind of assumes that the AI is already aligned?)
In the third paragraph it seems to me that you are saying that humans have some goals that have an built-in override mechanism in them—eg in general humans have a goal of eating delicious cake, but they will forego this goal in the interest of seeking water if they are about ot die of dehydratation (but doesn’t this seem to be a consequence of these goals being just instrumental things that proxy the complex thing that humans actually care about?)
I think I am confused because I do not understand your overall point, so the three paragraphs seem to be saying wildly different things to me.
Hey, thanks for the questions! It’s a very confusing topic so I definitely don’t have a fully coherent picture of it myself. But my best attempt at a coherent overall point:
No, I’m saying that giving an agent a goal, in the context of modern machine learning, involves reinforcement in the training regime. It’s not clear to me exactly what goals will result from this, but we can’t just assume that we can “give an AI the final goal of evaluating the Riemann hypothesis” in a way that’s devoid of all context.
It may be the case that it’s very hard to train AIs without common sense of some kind, potentially because a) that’s just the default for how minds work, they don’t by default extrapolate to crazy edge cases. And b) common sense is very useful in general. For example, if you train AIs on obeying human instructions, then they will only do well in the training environment if they have a common-sense understanding of what humans mean.
No, it’s more that the goal itself is only defined in a small-scale setting, because the agent doesn’t think in ways which naturally extrapolate small-scale goals to large scales.
Perhaps it’s useful to think about having the goal of getting a coffee. And suppose there is some unusual action you can take to increase the chances that you get the coffee by 1%. For example, you could order ten coffees instead of one coffee, to make sure at least one of them arrives. There are at least two reasons you might not take this unusual action. In some cases it goes against your values—for example, if you want to save money. But even if that’s not true, you might just not think about what you’re doing as “ensure that I have coffee with maximum probability”, but rather just “get a coffee”. This goal is not high-stakes enough for you to actually extrapolate beyond the standard context. And then some people are just like that with all their goals—so why couldn’t an AI be too?
I think this helped me a lot understand you a bit better—thank you
Let me try paraphrasing this:
> Humans are our best example of a sort-of-general intelligence. And humans have a lazy, satisfying, ‘small-scale’ kind of reasoning that is mostly only well suited for activities close to their ‘training regime’. Hence AGIs may also be the same—and in particular if AGIs are trained with Reinforcement Learning and heavily rewarded for following human intentions this may be a likely outcome.
Is that pointing in the direction you intended?
The distinction that you’re pointing at is useful. But I would have filed it under “difference in the degree of agency”, not under “difference in goals”. When reading the main text, I thought this to be the reason why you introduced the six criteria of agency.
E.g., System A tries to prove the Riemann hypothesis by thinking about the proof. System B first seizes power and converts the galaxy into a supercomputer, to then prove the Riemann hypothesis. Both systems maybe have the goal of “proving the Riemann hypothesis”, but System B has “more agency”: it certainly has self-awareness, considers more sophisticated and diverse plans of larger scale, and so on.