You’re mistaken about the view I’m arguing against. (Though perhaps in practice most people think I’m arguing against the view you point out, in which case I hope this post helps them realize their error.) In particular:
Whatever things you care about, you are best off assigning consistent numerical values to them and maximizing the expected sum of those values
If you start by assuming that the agent cares about things, and your prior is that the things it cares about are “simple” (e.g. it is very unlikely to be optimizing the-utility-function-that-makes-the-twitching-robot-optimal), then I think the argument goes through fine. According to me, this means you have assumed goal-directedness in from the start, and are now seeing what the implications of goal-directedness are.
My claim is that if you don’t assume that the agent cares about things, coherence arguments don’t let you say “actually, principles of rationality tell me that since this agent is superintelligent it must care about things”.
Stated this way it sounds almost obvious that the argument doesn’t work, but I used to hear things that effectively meant this pretty frequently in the past. Those arguments usually go something like this:
By hypothesis, we will have superintelligent agents.
A superintelligent agent will follow principles of rationality, and thus will satisfy the VNM axioms.
Therefore it can be modeled as an EU maximizer.
Therefore it pursues convergent instrumental subgoals and kill us all.
This talk for example gives the impression that this sort of argument works. (If you look carefully, you can see that it does state that the AI is programmed to have “objects of concern”, which is where the goal-directedness assumption comes in, but you can see why people might not notice that as an assumption.)
----
You might think “well, obviously the superintelligent AI system is going to care about things, maybe it’s technically an assumption but surely that’s a fine assumption”. I think on balance I agree, but it doesn’t seem nearly so obvious to me, and seems to depend on how exactly the agent is built. For example, it’s plausible to me that superintelligent expert systems would not be accurately described as “caring about things”, and I don’t think it was a priori obvious that expert systems wouldn’t lead to AGI. Similarly, it seems at best questionable whether GPT-3 can be accurately described as “caring about things”.
----
As to whether this argument is relevant for whether we will build goal-directed systems: I don’t think that in isolation my argument should strongly change your view on the probability you assign to that claim. I see it more as a constraint on what arguments you can supply in support of that view. If you really were just saying “VNM theorem, therefore 99%”, then probably you should become less confident, but I expect in practice people were not doing that and so it’s not obvious how exactly their probabilities should change.
----
I’d appreciate advice on how to change the post to make this clearer—I feel like your response is quite common, and I haven’t yet figured out how to reliably convey the thing I actually mean.
Thanks. Let me check if I understand you correctly:
You think I take the original argument to be arguing from ‘has goals’ to ‘has goals’, essentially, and agree that that holds, but don’t find it very interesting/relevant.
What you disagree with is an argument from ‘anything smart’ to ‘has goals’, which seems to be what is needed for the AI risk argument to apply to any superintelligent agent.
Is that right?
If so, I think it’s helpful to distinguish between ‘weakly has goals’ and ‘strongly has goals’:
Weakly has goals: ‘has some sort of drive toward something, at least sometimes’ (e.g. aspects of outcomes are taken into account in decisions in some way)
Strongly has goals: ’pursues outcomes consistently and effectively’ (i.e. decisions maximize expected utility)
So that the full argument I currently take you to be responding to is closer to:
By hypothesis, we will have superintelligent machines
They will weakly have goals (for various reasons, e.g. they will do something, and maybe that means ‘weakly having goals’ in the relevant way? Probably other arguments go in here.)
Anything that weakly has goals has reason to reform to become an EU maximizer, i.e. to strongly have goals
Therefore we will have superintelligent machines that strongly have goals
In that case, my current understanding is that you are disagreeing with 2, and that you agree that if 2 holds in some case, then the argument goes through. That is, creatures that are weakly goal directed are liable to become strongly goal directed. (e.g. an agent that twitches because it has various flickering and potentially conflicting urges toward different outcomes is liable to become an agent that more systematically seeks to bring about some such outcomes) Does that sound right?
If so, I think we agree. (In my intuition I characterize the situation as ‘there is roughly a gradient of goal directedness, and a force pulling less goal directed things into being more goal directed. This force probably doesn’t exist out at the zero goal directness edges, but it unclear how strong it is in the rest of the space—i.e. whether it becomes substantial as soon as you move out from zero goal directedness, or is weak until you are in a few specific places right next to ‘maximally goal directed’.)
You think I take the original argument to be arguing from ‘has goals’ to ‘has goals’, essentially, and agree that that holds, but don’t find it very interesting/relevant.
Well, I do think it is an interesting/relevant argument (because as you say it explains how you get from “weakly has goals” to “strongly has goals”). I just wanted to correct the misconception about what I was arguing against, and I wanted to highlight the “intelligent” --> “weakly has goals” step as a relatively weak step in our current arguments. (In my original post, my main point was that that step doesn’t follow from pure math / logic.)
In that case, my current understanding is that you are disagreeing with 2, and that you agree that if 2 holds in some case, then the argument goes through.
At least, the argument makes sense. I don’t know how strong its effect is—basically I agree with your phrasing here:
This force probably doesn’t exist out at the zero goal directness edges, but it unclear how strong it is in the rest of the space—i.e. whether it becomes substantial as soon as you move out from zero goal directedness, or is weak until you are in a few specific places right next to ‘maximally goal directed’.)
I wrote an AI Impacts page summary of the situation as I understand it. If anyone feels like looking, I’m interested in corrections/suggestions (either here or in the AI Impacts feedback box).
There’s a particular kind of cognition that considers a variety of plans, predicts their consequences, rates them according to some (simple / reasonable but not aligned) metric, and executes the one that scores highest.
That sort of cognition will consider plans that disempower humans and then directly improve the metric, as well as plans that don’t disempower humans and directly improve the metric, and in predicting the consequences of these plans and rating them will assign higher ratings to the plans that do disempower humans.
I kinda want to set aside the term “goal directed” and ask instead to what extent the AI’s cognition does the thing above (or something basically equivalent). When I’ve previously said “intelligence without goal directedness” I think what I should have been saying “cognitions that aren’t structured as described above, that nonetheless are useful to us in the real world”.
(For example, an AI that predicts consequences of plans and directly translates those consequences into human-understandable terms, would be very useful while not implementing the dangerous sort of cognition above.)
I certainly agree that even amongst programs that don’t have the structure above there are still ones that are more coherent or less coherent. I’m mostly saying that this seems not very related to whether such a system takes over the world and so I don’t think about it that much.
I think it can make sense to talk about an agent having coherent preferences over its internal state, and whether that’s a useful abstraction depends on why you’re analyzing that agent and more concrete details of the setup.
For instance say you have an oracle AI that takes compute steps to minimize incoherence in its world models. It doesn’t evaluate or compare steps before taking them, atleast as understood in the “agent” paradigm. But it is intelligent in some sense and it somehow takes the correct steps a lot of the time.
If you connect this AI to a secondary memory, it seems reasonable to predict that it will probably start writing to the new memory too, to have larger but coherent models.
I don’t buy this. You could finetune a language model today to do chain-of-thought reasoning, which sounds like “an oracle AI that takes compute steps to minimize incoherence in world models”. I predict that if you then add an additional head that can do read/write to an external memory, or you allow it to output sentences like “Write [X] to memory slot [Y]” or “Read from memory slot [Y]” that we support via an external database, it does not then start using that memory in some particularly useful way.
You could say that this is because current language models are too dumb, but I don’t particularly see why this will necessarily change in the future (besides that we will probably specifically train the models to use external memory). Overall I’m at “sure maybe a sufficiently intelligent oracle AI would do this but it seems like whether it happens depends on details of how the AI works”.
I get the feeling that a one-dimensional view of directed/purposeful behavior (weak → strong) is constraining the discussion. I do not think we can take as obvious the premise that biologically-evolved intelligence is more strongly directed than other organisms, which, in many different ways, are all strongly directed to reproduction, but it seems there are more dimensions to intelligent purposeful behavior. There are some pathological forms of directedness, such as obsessive behavior and drug addiction, but they have not come to dominate in biological intelligence. We generally pursue multiple goals which cannot all be maximized together, and our goals are often satisfied with some mix of less-than-maximal outcomes.
So, even if coherence does imply a force for goal-directed behavior, we have some empirical evidence that it is not necessarily directed towards pathological versions—unless, of course, we have just not been around long enough to see that ultimately, it does!
I think maybe one thing going on is that I already took the coherence arguments to apply only in getting you from weakly having goals to strongly having goals, so since you were arguing against their applicability, I thought you were talking about the step from weaker to stronger goal direction. (I’m not sure what arguments people use to get from 1 to 2 though, so maybe you are right that it is also something to do with coherence, at least implicitly.)
It also seems natural to think of ‘weakly has goals’ as something other than ‘goal directed’, and ‘goal directed’ as referring only to ‘strongly has goals’, so that ‘coherence arguments do not imply goal directed behavior’ (in combination with expecting coherence arguments to be in the weak->strong part of the argument) sounds like ‘coherence arguments do not get you from ‘weakly has goals’ to ‘strongly has goals’.
I also think separating out the step from no goal direction to weak, and weak to strong might be helpful in clarity. It sounded to me like you were considering an argument from ‘any kind of agent’ to ‘strong goal directed’ and finding it lacking, and I was like ‘but any kind of agent includes a mix of those that this force will work on, and those it won’t, so shouldn’t it be a partial/probabilistic move toward goal direction?’ Whereas you were just meaning to talk about what fraction of existing things are weakly goal directed.
Maybe changing the title would prime people less to have the wrong interpretation? E.g., to ‘Coherence arguments require that the system care about something’.
Even just ‘Coherence arguments do not entail goal-directed behavior’ might help, since colloquial “imply” tends to be probabilistic, but you mean math/logic “imply” instead. Or ‘Coherence theorems do not entail goal-directed behavior on their own’.
I think that if a system is designed to do something, anything, it needs at least to care about doing that thing or approximate.
GPT-3 can be described in a broad sense as caring about following the current prompt (in a way affected by fine-tuning).
I wonder though if there are things that you can care about that do not have certain goals that could maximize EU. I mean a system for which the most optimal path is not to reach some certain point in a subspace of possibilities, with sacrifices on axes that the system does not care about, but to maintain some other dynamics while ignoring other axes.
Like gravity can make you reach singularity or can make you orbit (simplistic visual analogy).
You’re mistaken about the view I’m arguing against. (Though perhaps in practice most people think I’m arguing against the view you point out, in which case I hope this post helps them realize their error.) In particular:
If you start by assuming that the agent cares about things, and your prior is that the things it cares about are “simple” (e.g. it is very unlikely to be optimizing the-utility-function-that-makes-the-twitching-robot-optimal), then I think the argument goes through fine. According to me, this means you have assumed goal-directedness in from the start, and are now seeing what the implications of goal-directedness are.
My claim is that if you don’t assume that the agent cares about things, coherence arguments don’t let you say “actually, principles of rationality tell me that since this agent is superintelligent it must care about things”.
Stated this way it sounds almost obvious that the argument doesn’t work, but I used to hear things that effectively meant this pretty frequently in the past. Those arguments usually go something like this:
By hypothesis, we will have superintelligent agents.
A superintelligent agent will follow principles of rationality, and thus will satisfy the VNM axioms.
Therefore it can be modeled as an EU maximizer.
Therefore it pursues convergent instrumental subgoals and kill us all.
This talk for example gives the impression that this sort of argument works. (If you look carefully, you can see that it does state that the AI is programmed to have “objects of concern”, which is where the goal-directedness assumption comes in, but you can see why people might not notice that as an assumption.)
----
You might think “well, obviously the superintelligent AI system is going to care about things, maybe it’s technically an assumption but surely that’s a fine assumption”. I think on balance I agree, but it doesn’t seem nearly so obvious to me, and seems to depend on how exactly the agent is built. For example, it’s plausible to me that superintelligent expert systems would not be accurately described as “caring about things”, and I don’t think it was a priori obvious that expert systems wouldn’t lead to AGI. Similarly, it seems at best questionable whether GPT-3 can be accurately described as “caring about things”.
----
As to whether this argument is relevant for whether we will build goal-directed systems: I don’t think that in isolation my argument should strongly change your view on the probability you assign to that claim. I see it more as a constraint on what arguments you can supply in support of that view. If you really were just saying “VNM theorem, therefore 99%”, then probably you should become less confident, but I expect in practice people were not doing that and so it’s not obvious how exactly their probabilities should change.
----
I’d appreciate advice on how to change the post to make this clearer—I feel like your response is quite common, and I haven’t yet figured out how to reliably convey the thing I actually mean.
Thanks. Let me check if I understand you correctly:
You think I take the original argument to be arguing from ‘has goals’ to ‘has goals’, essentially, and agree that that holds, but don’t find it very interesting/relevant.
What you disagree with is an argument from ‘anything smart’ to ‘has goals’, which seems to be what is needed for the AI risk argument to apply to any superintelligent agent.
Is that right?
If so, I think it’s helpful to distinguish between ‘weakly has goals’ and ‘strongly has goals’:
Weakly has goals: ‘has some sort of drive toward something, at least sometimes’ (e.g. aspects of outcomes are taken into account in decisions in some way)
Strongly has goals: ’pursues outcomes consistently and effectively’ (i.e. decisions maximize expected utility)
So that the full argument I currently take you to be responding to is closer to:
By hypothesis, we will have superintelligent machines
They will weakly have goals (for various reasons, e.g. they will do something, and maybe that means ‘weakly having goals’ in the relevant way? Probably other arguments go in here.)
Anything that weakly has goals has reason to reform to become an EU maximizer, i.e. to strongly have goals
Therefore we will have superintelligent machines that strongly have goals
In that case, my current understanding is that you are disagreeing with 2, and that you agree that if 2 holds in some case, then the argument goes through. That is, creatures that are weakly goal directed are liable to become strongly goal directed. (e.g. an agent that twitches because it has various flickering and potentially conflicting urges toward different outcomes is liable to become an agent that more systematically seeks to bring about some such outcomes) Does that sound right?
If so, I think we agree. (In my intuition I characterize the situation as ‘there is roughly a gradient of goal directedness, and a force pulling less goal directed things into being more goal directed. This force probably doesn’t exist out at the zero goal directness edges, but it unclear how strong it is in the rest of the space—i.e. whether it becomes substantial as soon as you move out from zero goal directedness, or is weak until you are in a few specific places right next to ‘maximally goal directed’.)
Yes, that’s basically right.
Well, I do think it is an interesting/relevant argument (because as you say it explains how you get from “weakly has goals” to “strongly has goals”). I just wanted to correct the misconception about what I was arguing against, and I wanted to highlight the “intelligent” --> “weakly has goals” step as a relatively weak step in our current arguments. (In my original post, my main point was that that step doesn’t follow from pure math / logic.)
At least, the argument makes sense. I don’t know how strong its effect is—basically I agree with your phrasing here:
I wrote an AI Impacts page summary of the situation as I understand it. If anyone feels like looking, I’m interested in corrections/suggestions (either here or in the AI Impacts feedback box).
Looks good to me :)
There’s a particular kind of cognition that considers a variety of plans, predicts their consequences, rates them according to some (simple / reasonable but not aligned) metric, and executes the one that scores highest.
That sort of cognition will consider plans that disempower humans and then directly improve the metric, as well as plans that don’t disempower humans and directly improve the metric, and in predicting the consequences of these plans and rating them will assign higher ratings to the plans that do disempower humans.
I kinda want to set aside the term “goal directed” and ask instead to what extent the AI’s cognition does the thing above (or something basically equivalent). When I’ve previously said “intelligence without goal directedness” I think what I should have been saying “cognitions that aren’t structured as described above, that nonetheless are useful to us in the real world”.
(For example, an AI that predicts consequences of plans and directly translates those consequences into human-understandable terms, would be very useful while not implementing the dangerous sort of cognition above.)
I certainly agree that even amongst programs that don’t have the structure above there are still ones that are more coherent or less coherent. I’m mostly saying that this seems not very related to whether such a system takes over the world and so I don’t think about it that much.
I think it can make sense to talk about an agent having coherent preferences over its internal state, and whether that’s a useful abstraction depends on why you’re analyzing that agent and more concrete details of the setup.
I don’t buy this. You could finetune a language model today to do chain-of-thought reasoning, which sounds like “an oracle AI that takes compute steps to minimize incoherence in world models”. I predict that if you then add an additional head that can do read/write to an external memory, or you allow it to output sentences like “Write [X] to memory slot [Y]” or “Read from memory slot [Y]” that we support via an external database, it does not then start using that memory in some particularly useful way.
You could say that this is because current language models are too dumb, but I don’t particularly see why this will necessarily change in the future (besides that we will probably specifically train the models to use external memory). Overall I’m at “sure maybe a sufficiently intelligent oracle AI would do this but it seems like whether it happens depends on details of how the AI works”.
I get the feeling that a one-dimensional view of directed/purposeful behavior (weak → strong) is constraining the discussion. I do not think we can take as obvious the premise that biologically-evolved intelligence is more strongly directed than other organisms, which, in many different ways, are all strongly directed to reproduction, but it seems there are more dimensions to intelligent purposeful behavior. There are some pathological forms of directedness, such as obsessive behavior and drug addiction, but they have not come to dominate in biological intelligence. We generally pursue multiple goals which cannot all be maximized together, and our goals are often satisfied with some mix of less-than-maximal outcomes.
So, even if coherence does imply a force for goal-directed behavior, we have some empirical evidence that it is not necessarily directed towards pathological versions—unless, of course, we have just not been around long enough to see that ultimately, it does!
A few quick thoughts on reasons for confusion:
I think maybe one thing going on is that I already took the coherence arguments to apply only in getting you from weakly having goals to strongly having goals, so since you were arguing against their applicability, I thought you were talking about the step from weaker to stronger goal direction. (I’m not sure what arguments people use to get from 1 to 2 though, so maybe you are right that it is also something to do with coherence, at least implicitly.)
It also seems natural to think of ‘weakly has goals’ as something other than ‘goal directed’, and ‘goal directed’ as referring only to ‘strongly has goals’, so that ‘coherence arguments do not imply goal directed behavior’ (in combination with expecting coherence arguments to be in the weak->strong part of the argument) sounds like ‘coherence arguments do not get you from ‘weakly has goals’ to ‘strongly has goals’.
I also think separating out the step from no goal direction to weak, and weak to strong might be helpful in clarity. It sounded to me like you were considering an argument from ‘any kind of agent’ to ‘strong goal directed’ and finding it lacking, and I was like ‘but any kind of agent includes a mix of those that this force will work on, and those it won’t, so shouldn’t it be a partial/probabilistic move toward goal direction?’ Whereas you were just meaning to talk about what fraction of existing things are weakly goal directed.
Thanks, that’s helpful. I’ll think about how to clarify this in the original post.
Maybe changing the title would prime people less to have the wrong interpretation? E.g., to ‘Coherence arguments require that the system care about something’.
Even just ‘Coherence arguments do not entail goal-directed behavior’ might help, since colloquial “imply” tends to be probabilistic, but you mean math/logic “imply” instead. Or ‘Coherence theorems do not entail goal-directed behavior on their own’.
I think that if a system is designed to do something, anything, it needs at least to care about doing that thing or approximate.
GPT-3 can be described in a broad sense as caring about following the current prompt (in a way affected by fine-tuning).
I wonder though if there are things that you can care about that do not have certain goals that could maximize EU. I mean a system for which the most optimal path is not to reach some certain point in a subspace of possibilities, with sacrifices on axes that the system does not care about, but to maintain some other dynamics while ignoring other axes.
Like gravity can make you reach singularity or can make you orbit (simplistic visual analogy).