<@Goal-directedness@>(@Intuitions about goal-directed behavior@) is one of the key drivers of AI risk: it’s the underlying factor that leads to convergent instrumental subgoals . However, it has eluded a good definition so far: we cannot simply say that it is the optimal policy for some simple reward function, as that would imply AlphaGo is not goal-directed (since it was beaten by AlphaZero), which seems wrong. Basically, we should not require _competence_ in order to call a system goal-directed, and so instead of only considering optimal policies, we can consider any policy that could have been output by an RL algorithm, perhaps with limited resources. Formally, we can construct a set of policies for G that can result from running e.g. SARSA with varying amounts of resources with G as the reward, and define the focus of a system towards G to be the distance of the system’s policy to the constructed set of policies.
Planned opinion:
I certainly agree that we should not require full competence in order to call a system goal-directed. I am less convinced of the particular construction here: current RL policies are typically terrible at generalization, and tabular SARSA explicitly doesn’t even _try_ to generalize, whereas I see generalization as a key feature of goal-directedness.
You could imagine the RL policies get more resources and so are able to understand the whole environment without generalization, e.g. if they get to update on every state at least once. However, in this case realistic goal-directed policies would be penalized for “not knowing what they should have known”. For example, suppose I want to eat sweet things, and I come across a new fruit I’ve never seen before. So I try the fruit, and it turns out it is very bitter. This would count as “not being goal-directed”, since the RL policies for “eat sweet things” would already know that the fruit is bitter and so wouldn’t eat it.
On the summary, I would say that the following sentence
Basically, we should not require _competence_ in order to call a system goal-directed, and so instead of only considering optimal policies, we can consider any policy that could have been output by an RL algorithm, perhaps with limited resources.
is written as if goal-directedness is a binary condition, just a more lenient one. I think your next sentence clarifies this a bit, but it might be worth it to mention at this point that the notion of goal-directedness considered here is more like a spectrum/scale.
For your opinion, I agree that this specific formalization is not meant to capture all of goal-directedness, just one aspect that I find important. (It’s also an element of answer to your question on the difference between “being good” and “trying hard”).
That being said, one point I disagree with in your opinion is about generalization. I’m relatively sure that focus doesn’t capture all of the generalization part of goal-directedness; but if we include model-based RL in the process, we might have some generalization. The trick is that the set of policies considered is the one generated by all possible RL methods on the reward. This is arguably very vague and hard to construct, which is why I mentioned the hope of reducing it to the policies generated by one or a few RL methods.
In light off this, I interpret your comment as pointing that limiting ourselves to SARSA makes us lose a lot, and thus is not a good idea. By the way, do you have a reference on that? That would be very useful, thanks.
Lastly, I find your example about my condition on the resources spot on. Even as I wrote it, I didn’t notice that requiring every state to be updated means that the policy “has seen it all” in some sense. This indeed limits the use of focus. That being said, your “eat sweet things” behavior might still have very good focus towards this goal, if your “wrong” exploratory behavior happens rarely enough.
When you encounter a particular state, you only update the Q-value of that state in the table, and don’t do anything to the Q-values of any other state. Therefore, seeing one state will make no difference to your policy on any other state, i.e. no generalization.
You need to use function approximators of some sort to see generalization to new states. (This doesn’t have to be a neural net—you could approximate the Q-function as a linear function over handcoded features, and this would also give you some generalization to new states.)
The trick is that the set of policies considered is the one generated by all possible RL methods on the reward. This is arguably very vague and hard to construct, which is why I mentioned the hope of reducing it to the policies generated by one or a few RL methods.
Yeah, I was ignoring the “all possible RL methods” because it was vague (and I also expect it not to work for any specific formalization, e.g. you’d have to rule out RL methods that say “if the goal is G, then output <specific policy>, otherwise do regular RL”, which seems non-trivial). If you use only a few RL methods, then I think I would stick with my claim about generalization:
current RL policies are typically terrible at generalization
I could add the sentence “Alternatively, if you try to use “all possible” RL algorithms, I expect that there will be many pathological RL algorithms that effectively make any policy goal-directed”, if you wanted me to, but I think the version with a small set of RL algorithms seems better to me and I’d rather keep the focus on that.
I updated the summary to:
<@Goal-directedness@>(@Intuitions about goal-directed behavior@) is one of the key drivers of AI risk: it’s the underlying factor that leads to convergent instrumental subgoals . However, it has eluded a good definition so far: we cannot simply say that it is the optimal policy for some simple reward function, as that would imply AlphaGo is not goal-directed (since it was beaten by AlphaZero), which seems wrong. Basically, goal-directedness should not be tied directly to _competence_. So, instead of only considering optimal policies, we can consider any policy that could have been output by an RL algorithm, perhaps with limited resources. Formally, we can construct a set of policies for G that can result from running e.g. SARSA with varying amounts of resources with G as the reward, and define the focus of a system towards G to be the distance of the system’s policy to the constructed set of policies.
When you encounter a particular state, you only update the Q-value of that state in the table, and don’t do anything to the Q-values of any other state. Therefore, seeing one state will make no difference to your policy on any other state, i.e. no generalization.
You need to use function approximators of some sort to see generalization to new states. (This doesn’t have to be a neural net—you could approximate the Q-function as a linear function over handcoded features, and this would also give you some generalization to new states.)
Okay, thanks for the explanation!
Yeah, I was ignoring the “all possible RL methods” because it was vague (and I also expect it not to work for any specific formalization, e.g. you’d have to rule out RL methods that say “if the goal is G, then output <specific policy>, otherwise do regular RL”, which seems non-trivial). If you use only a few RL methods, then I think I would stick with my claim about generalization:
Yes, trying to ensure that not all policies are generated is indeed the main issue here. It also underlies the resource condition. This makes me think that maybe using RL is not the appropriate way. That being said, I still think an approach exists for computing focus instead of competence. I just don’t know it yet.
I could add the sentence “Alternatively, if you try to use “all possible” RL algorithms, I expect that there will be many pathological RL algorithms that effectively make any policy goal-directed”, if you wanted me to, but I think the version with a small set of RL algorithms seems better to me and I’d rather keep the focus on that.
I agree that keeping the focus (!) on the more realistic case makes more sense here.
Planned summary for the Alignment Newsletter:
Planned opinion:
Thanks for the summary and opinion!
On the summary, I would say that the following sentence
is written as if goal-directedness is a binary condition, just a more lenient one. I think your next sentence clarifies this a bit, but it might be worth it to mention at this point that the notion of goal-directedness considered here is more like a spectrum/scale.
For your opinion, I agree that this specific formalization is not meant to capture all of goal-directedness, just one aspect that I find important. (It’s also an element of answer to your question on the difference between “being good” and “trying hard”).
That being said, one point I disagree with in your opinion is about generalization. I’m relatively sure that focus doesn’t capture all of the generalization part of goal-directedness; but if we include model-based RL in the process, we might have some generalization. The trick is that the set of policies considered is the one generated by all possible RL methods on the reward. This is arguably very vague and hard to construct, which is why I mentioned the hope of reducing it to the policies generated by one or a few RL methods.
In light off this, I interpret your comment as pointing that limiting ourselves to SARSA makes us lose a lot, and thus is not a good idea. By the way, do you have a reference on that? That would be very useful, thanks.
Lastly, I find your example about my condition on the resources spot on. Even as I wrote it, I didn’t notice that requiring every state to be updated means that the policy “has seen it all” in some sense. This indeed limits the use of focus. That being said, your “eat sweet things” behavior might still have very good focus towards this goal, if your “wrong” exploratory behavior happens rarely enough.
When you encounter a particular state, you only update the Q-value of that state in the table, and don’t do anything to the Q-values of any other state. Therefore, seeing one state will make no difference to your policy on any other state, i.e. no generalization.
You need to use function approximators of some sort to see generalization to new states. (This doesn’t have to be a neural net—you could approximate the Q-function as a linear function over handcoded features, and this would also give you some generalization to new states.)
Yeah, I was ignoring the “all possible RL methods” because it was vague (and I also expect it not to work for any specific formalization, e.g. you’d have to rule out RL methods that say “if the goal is G, then output <specific policy>, otherwise do regular RL”, which seems non-trivial). If you use only a few RL methods, then I think I would stick with my claim about generalization:
I could add the sentence “Alternatively, if you try to use “all possible” RL algorithms, I expect that there will be many pathological RL algorithms that effectively make any policy goal-directed”, if you wanted me to, but I think the version with a small set of RL algorithms seems better to me and I’d rather keep the focus on that.
I updated the summary to:
Okay, thanks for the explanation!
Yes, trying to ensure that not all policies are generated is indeed the main issue here. It also underlies the resource condition. This makes me think that maybe using RL is not the appropriate way. That being said, I still think an approach exists for computing focus instead of competence. I just don’t know it yet.
I agree that keeping the focus (!) on the more realistic case makes more sense here.