Intuitively, any agent that searches over actions and chooses the one that best achieves some measure of “goodness” is goal-directed (though there are exceptions, such as the agent that selects actions that begin with the letter “A”). (ETA: I also think that agents that show goal-directed behavior because they are looking at some other agent are not goal-directed themselves—see this comment.)
This post did not convince me that the bolded part is not (almost) a solution
I would honestly be very surprised if [whatever definition of goal-directedness we end up with] will be such that approval-directed agents are not goal-directed. In what sense is “do what this other agent likes?” not a goal? It also seems like you could define lots of things that are kind of like approval-directedness (do what agent a likes, do what is good for a, do what a will reward me for). Are some of those goal-directed and others not?
Excluding approval-directedness feels like patching the natural formulation of goal-directedness: ‘It has to be a any kind of goal except doing what someone likes’, and even then there seem to be many goals that are smilar as ‘do what someone likes’ which would have to be patched as well. What’s wrong with saying approval-directedness is a special kind of goal-directedness?
I agree that the agent that selects actions that begin with ‘A’ is a counter-example, but this feels like it belongs to a specific class of counter-examples where it’s okay (or even necessary) to patch them. As in, you have these real things (possible actions), you have your representation of them, and now you define something in terms of the representation. (This feels similar to the agent that runs a search algorithm for a random objective, stops as soon as one of its registers is set to 56789ABC and then outputs the policy it’s currently considering.) A definition just has to say that the ‘goodness’ criterion is not allowed to depend on the representation (names of policies/parts of the code/etc.).
Is there a counter-example (something that runs a search but isn’t goal directed) that isn’t about the representation?
(This discussion is extremely imprecise and vague; see the upcoming MIRI paper on “The Inner Alignment Problem” for a more thorough discussion.)
Yeah I think I broadly agree with this perspective (in contrast to how I used to think about it), but I still think there is a meaningful distinction in an AI agent that pursues goals that other agents (e.g. humans) have. I would no longer call it “not goal-directed”, but it still seems like an important distinction.
This post did not convince me that the bolded part is not (almost) a solution
I would honestly be very surprised if [whatever definition of goal-directedness we end up with] will be such that approval-directed agents are not goal-directed. In what sense is “do what this other agent likes?” not a goal? It also seems like you could define lots of things that are kind of like approval-directedness (do what agent a likes, do what is good for a, do what a will reward me for). Are some of those goal-directed and others not?
Excluding approval-directedness feels like patching the natural formulation of goal-directedness: ‘It has to be a any kind of goal except doing what someone likes’, and even then there seem to be many goals that are smilar as ‘do what someone likes’ which would have to be patched as well. What’s wrong with saying approval-directedness is a special kind of goal-directedness?
I agree that the agent that selects actions that begin with ‘A’ is a counter-example, but this feels like it belongs to a specific class of counter-examples where it’s okay (or even necessary) to patch them. As in, you have these real things (possible actions), you have your representation of them, and now you define something in terms of the representation. (This feels similar to the agent that runs a search algorithm for a random objective, stops as soon as one of its registers is set to 56789ABC and then outputs the policy it’s currently considering.) A definition just has to say that the ‘goodness’ criterion is not allowed to depend on the representation (names of policies/parts of the code/etc.).
Is there a counter-example (something that runs a search but isn’t goal directed) that isn’t about the representation?
I assume an updated version of this would link to the Risks from Learned Optimization paper.
Yeah I think I broadly agree with this perspective (in contrast to how I used to think about it), but I still think there is a meaningful distinction in an AI agent that pursues goals that other agents (e.g. humans) have. I would no longer call it “not goal-directed”, but it still seems like an important distinction.
Yes, updated, thanks.