When you encounter a particular state, you only update the Q-value of that state in the table, and don’t do anything to the Q-values of any other state. Therefore, seeing one state will make no difference to your policy on any other state, i.e. no generalization.
You need to use function approximators of some sort to see generalization to new states. (This doesn’t have to be a neural net—you could approximate the Q-function as a linear function over handcoded features, and this would also give you some generalization to new states.)
The trick is that the set of policies considered is the one generated by all possible RL methods on the reward. This is arguably very vague and hard to construct, which is why I mentioned the hope of reducing it to the policies generated by one or a few RL methods.
Yeah, I was ignoring the “all possible RL methods” because it was vague (and I also expect it not to work for any specific formalization, e.g. you’d have to rule out RL methods that say “if the goal is G, then output <specific policy>, otherwise do regular RL”, which seems non-trivial). If you use only a few RL methods, then I think I would stick with my claim about generalization:
current RL policies are typically terrible at generalization
I could add the sentence “Alternatively, if you try to use “all possible” RL algorithms, I expect that there will be many pathological RL algorithms that effectively make any policy goal-directed”, if you wanted me to, but I think the version with a small set of RL algorithms seems better to me and I’d rather keep the focus on that.
I updated the summary to:
<@Goal-directedness@>(@Intuitions about goal-directed behavior@) is one of the key drivers of AI risk: it’s the underlying factor that leads to convergent instrumental subgoals . However, it has eluded a good definition so far: we cannot simply say that it is the optimal policy for some simple reward function, as that would imply AlphaGo is not goal-directed (since it was beaten by AlphaZero), which seems wrong. Basically, goal-directedness should not be tied directly to _competence_. So, instead of only considering optimal policies, we can consider any policy that could have been output by an RL algorithm, perhaps with limited resources. Formally, we can construct a set of policies for G that can result from running e.g. SARSA with varying amounts of resources with G as the reward, and define the focus of a system towards G to be the distance of the system’s policy to the constructed set of policies.
When you encounter a particular state, you only update the Q-value of that state in the table, and don’t do anything to the Q-values of any other state. Therefore, seeing one state will make no difference to your policy on any other state, i.e. no generalization.
You need to use function approximators of some sort to see generalization to new states. (This doesn’t have to be a neural net—you could approximate the Q-function as a linear function over handcoded features, and this would also give you some generalization to new states.)
Okay, thanks for the explanation!
Yeah, I was ignoring the “all possible RL methods” because it was vague (and I also expect it not to work for any specific formalization, e.g. you’d have to rule out RL methods that say “if the goal is G, then output <specific policy>, otherwise do regular RL”, which seems non-trivial). If you use only a few RL methods, then I think I would stick with my claim about generalization:
Yes, trying to ensure that not all policies are generated is indeed the main issue here. It also underlies the resource condition. This makes me think that maybe using RL is not the appropriate way. That being said, I still think an approach exists for computing focus instead of competence. I just don’t know it yet.
I could add the sentence “Alternatively, if you try to use “all possible” RL algorithms, I expect that there will be many pathological RL algorithms that effectively make any policy goal-directed”, if you wanted me to, but I think the version with a small set of RL algorithms seems better to me and I’d rather keep the focus on that.
I agree that keeping the focus (!) on the more realistic case makes more sense here.
When you encounter a particular state, you only update the Q-value of that state in the table, and don’t do anything to the Q-values of any other state. Therefore, seeing one state will make no difference to your policy on any other state, i.e. no generalization.
You need to use function approximators of some sort to see generalization to new states. (This doesn’t have to be a neural net—you could approximate the Q-function as a linear function over handcoded features, and this would also give you some generalization to new states.)
Yeah, I was ignoring the “all possible RL methods” because it was vague (and I also expect it not to work for any specific formalization, e.g. you’d have to rule out RL methods that say “if the goal is G, then output <specific policy>, otherwise do regular RL”, which seems non-trivial). If you use only a few RL methods, then I think I would stick with my claim about generalization:
I could add the sentence “Alternatively, if you try to use “all possible” RL algorithms, I expect that there will be many pathological RL algorithms that effectively make any policy goal-directed”, if you wanted me to, but I think the version with a small set of RL algorithms seems better to me and I’d rather keep the focus on that.
I updated the summary to:
Okay, thanks for the explanation!
Yes, trying to ensure that not all policies are generated is indeed the main issue here. It also underlies the resource condition. This makes me think that maybe using RL is not the appropriate way. That being said, I still think an approach exists for computing focus instead of competence. I just don’t know it yet.
I agree that keeping the focus (!) on the more realistic case makes more sense here.