I think that strictly speaking this post (or at least the main thrust) is true, and proven in the first section. The title is arguably less true: I think of ‘coherence arguments’ as including things like ‘it’s not possible for you to agree to give me a limitless number of dollars in return for nothing’, which does imply some degree of ‘goal-direction’.
I think the post is important, because it constrains the types of valid arguments that can be given for ‘freaking out about goal-directedness’, for lack of a better term. In my mind, it provokes various follow-up questions:
What arguments would imply ‘goal-directed’ behaviour?
With what probability will a random utility maximiser be ‘goal-directed’?
How often should I think of a system as a utility maximiser in resources, perhaps with a slowly-changing utility function?
How ‘goal-directed’ are humans likely to make systems, given that we are making them in order to accomplish certain tasks that don’t look like random utility functions?
Is there some kind of ‘basin of goal-directedness’ that systems fall in if they’re even a little goal-directed, causing them to behave poorly?
Off the top of my head, I’m not familiar with compelling responses from the ‘freak out about goal-directedness’ camp on points 1 through 5, even though as a member of that camp I think that such responses exist. Responses from outside this camp include Rohin’s post ‘Will humans build goal-directed agents?‘. Another response is Brangus’ comment post, although I find its theory of goal-directedness uncompelling.
I think that it’s notable that Brangus’ post was released soon after this was announced as a contender for Best of LW 2018. I think that if this post were added to the Best of LW 2018 Collection, the ‘freak out’ camp might produce more of these responses and move the dialogue forward. As such, I think it should be added, both because of the clear argumentation and because of the response it is likely to provoke.
Putting my cards on the table, this is my guess at the answers to the questions that I raise:
I don’t know.
Low.
Frequent if it’s an ‘intelligent’ one.
Relatively. You probably don’t end up with systems that resist literally all changes to their goals, but you probably do end up with systems that resist most changes to their goals, barring specific effort to prevent that.
Probably.
That being said, I think that a better definition of ‘goal-directedness’ would go a long way in making me less confused by the topic.
“random utility-maximizer” is pretty ambiguous; if you imagine the space of all possible utility functions over action-observation histories and you imagine a uniform distribution over them (suppose they’re finite, so this is doable), then the answer is low.
Heh, looking at my comment it turns out I said roughly the same thing 3 years ago.
I pretty strongly agree with this review (and jtbc it was written without any input from me, even though Daniel and I are both at CHAI).
I think of ‘coherence arguments’ as including things like ‘it’s not possible for you to agree to give me a limitless number of dollars in return for nothing’, which does imply some degree of ‘goal-direction’.
Yeah, maybe I should say “coherence theorems” to be clearer about this? (Like, it isn’t a theorem that I shouldn’t give you limitless number of dollars in return for nothing; maybe I think that you are more capable than me and fully aligned with me, and so you’d do a better job with my money. Or maybe I value your happiness, and the best way to purchase it is to give you money no strings attached.)
Responses from outside this camp
Fwiw, I do in fact worry about goal-directedness, but (I think) I know what you mean. (For others, I think Daniel is referring to something like “the MIRI camp”, though that is also not an accurate pointer, and it is true that I am outside that camp.)
Depends on the distribution over utility functions, the action space, etc, but e.g. if it uniformly selects a numeric reward value for each possible trajectory (state-action sequence) where the actions are low-level (e.g. human muscle control), astronomically low.
That will probably be a good model for some (many?) powerful AI systems that humans build.
I don’t know. (I think it depends quite strongly on the way in which we train powerful AI systems.)
Not likely at low levels of intelligence, plausible at higher levels of intelligence, but really the question is not specified enough.
Well, I didn’t consult you in the process of writing the review, but we’ve had many conversations on the topic which presumably have influenced how I think about the topic and what I ended up writing in the review.
I think of ‘coherence arguments’ as including things like ‘it’s not possible for you to agree to give me a limitless number of dollars in return for nothing’, which does imply some degree of ‘goal-direction’.
Yeah, maybe I should say “coherence theorems” to be clearer about this?
Sorry, I meant theorems taking ‘no limitless dollar sink’ as an axiom and deriving something interesting from that.
I think that strictly speaking this post (or at least the main thrust) is true, and proven in the first section. The title is arguably less true: I think of ‘coherence arguments’ as including things like ‘it’s not possible for you to agree to give me a limitless number of dollars in return for nothing’, which does imply some degree of ‘goal-direction’.
I think the post is important, because it constrains the types of valid arguments that can be given for ‘freaking out about goal-directedness’, for lack of a better term. In my mind, it provokes various follow-up questions:
What arguments would imply ‘goal-directed’ behaviour?
With what probability will a random utility maximiser be ‘goal-directed’?
How often should I think of a system as a utility maximiser in resources, perhaps with a slowly-changing utility function?
How ‘goal-directed’ are humans likely to make systems, given that we are making them in order to accomplish certain tasks that don’t look like random utility functions?
Is there some kind of ‘basin of goal-directedness’ that systems fall in if they’re even a little goal-directed, causing them to behave poorly?
Off the top of my head, I’m not familiar with compelling responses from the ‘freak out about goal-directedness’ camp on points 1 through 5, even though as a member of that camp I think that such responses exist. Responses from outside this camp include Rohin’s post ‘Will humans build goal-directed agents?‘. Another response is Brangus’ comment post, although I find its theory of goal-directedness uncompelling.
I think that it’s notable that Brangus’ post was released soon after this was announced as a contender for Best of LW 2018. I think that if this post were added to the Best of LW 2018 Collection, the ‘freak out’ camp might produce more of these responses and move the dialogue forward. As such, I think it should be added, both because of the clear argumentation and because of the response it is likely to provoke.
Putting my cards on the table, this is my guess at the answers to the questions that I raise:
I don’t know.
Low.
Frequent if it’s an ‘intelligent’ one.
Relatively. You probably don’t end up with systems that resist literally all changes to their goals, but you probably do end up with systems that resist most changes to their goals, barring specific effort to prevent that.
Probably.
That being said, I think that a better definition of ‘goal-directedness’ would go a long way in making me less confused by the topic.
I have no idea why I responded ‘low’ to 2. Does anybody think that’s reasonable and fits in with what I wrote here, or did I just mean high?
“random utility-maximizer” is pretty ambiguous; if you imagine the space of all possible utility functions over action-observation histories and you imagine a uniform distribution over them (suppose they’re finite, so this is doable), then the answer is low.
Heh, looking at my comment it turns out I said roughly the same thing 3 years ago.
I pretty strongly agree with this review (and jtbc it was written without any input from me, even though Daniel and I are both at CHAI).
Yeah, maybe I should say “coherence theorems” to be clearer about this? (Like, it isn’t a theorem that I shouldn’t give you limitless number of dollars in return for nothing; maybe I think that you are more capable than me and fully aligned with me, and so you’d do a better job with my money. Or maybe I value your happiness, and the best way to purchase it is to give you money no strings attached.)
Fwiw, I do in fact worry about goal-directedness, but (I think) I know what you mean. (For others, I think Daniel is referring to something like “the MIRI camp”, though that is also not an accurate pointer, and it is true that I am outside that camp.)
My responses to the questions:
The ones in Will humans build goal-directed agents?, but if you want arguments that aren’t about humans, then I don’t know.
Depends on the distribution over utility functions, the action space, etc, but e.g. if it uniformly selects a numeric reward value for each possible trajectory (state-action sequence) where the actions are low-level (e.g. human muscle control), astronomically low.
That will probably be a good model for some (many?) powerful AI systems that humans build.
I don’t know. (I think it depends quite strongly on the way in which we train powerful AI systems.)
Not likely at low levels of intelligence, plausible at higher levels of intelligence, but really the question is not specified enough.
Well, I didn’t consult you in the process of writing the review, but we’ve had many conversations on the topic which presumably have influenced how I think about the topic and what I ended up writing in the review.
Sorry, I meant theorems taking ‘no limitless dollar sink’ as an axiom and deriving something interesting from that.