Credit assignment (AKA policy gradient) credits the diamond-recognizing circuit as responsible for reward, thereby retaining this diamond abstraction in the weights of the network.
This is different from how I imagine the situation. In my mind, the diamond-circuit remains simply because it is a good abstraction for making predictions about the world. Its existence is, in my imagination, not related to an RL update process.
Other than that, I think the rest of your comment doesn’t quite answer my concern, so I try to formalize it more. Let’s work in the simple setting that the policy network has no world model and is simply a non-recurrent function f:O→Δ(A) mapping from observations to probability distributions over actions. I imagine a simple version of shard theory to claim that f decomposes as follows:
f(o)=SM(∑iai(o)⋅fi(o)),
where i is an index for enumerating shards, ai(o) is the contextual strength of activation of the i-th shard (maybe with 0≤ai(o)≤1), and fi(o) is the action-bid of the i-th shard, i.e., the vector of log-probabilities it would like to see for different actions. Then SM is the softmax function, producing the final probabilities.
In your story, the diamond shard starts out as very strong. Let’s say it’s indexed by 0 and that a0(o)≈1 for most inputs o and that f0 has a large “capacity” at its disposal so that it will in principle be able to represent behaviors for many different tasks.
Now, if a new task pops up, like solving a maze, in a specific context om, I imagine that two things could happen to make this possible:
f0(om) could get updated to also represent this new behavior
The strength a0(o) could get weighed down and some other shard could learn to represent this new behavior.
One reason why the latter may happen is that f0 possibly becomes so complicated that it’s “hard to attach more behavior to it”; maybe it’s just simpler to create an entirely new module that solves this task and doesn’t care about diamonds. If something like this happens often enough, then eventually, the diamond shard may lose all its influence.
One reason why the latter may happen is that f0 possibly becomes so complicated that it’s “hard to attach more behavior to it”; maybe it’s just simpler to create an entirely new module that solves this task and doesn’t care about diamonds. If something like this happens often enough, then eventually, the diamond shard may lose all its influence.
I don’t currently share your intuitions for this particular technical phenomenon being plausible, but imagine there are other possible reasons this could happen, so sure? I agree that there are some ways the diamond-shard could lose influence. But mostly, again, I expect this to be a quantitative question, and I think experience with people suggests that trying a fun new activity won’t wipe away your other important values.
Thanks for your answer!
This is different from how I imagine the situation. In my mind, the diamond-circuit remains simply because it is a good abstraction for making predictions about the world. Its existence is, in my imagination, not related to an RL update process.
Other than that, I think the rest of your comment doesn’t quite answer my concern, so I try to formalize it more. Let’s work in the simple setting that the policy network has no world model and is simply a non-recurrent function f:O→Δ(A) mapping from observations to probability distributions over actions. I imagine a simple version of shard theory to claim that f decomposes as follows:
f(o)=SM(∑iai(o)⋅fi(o)),
where i is an index for enumerating shards, ai(o) is the contextual strength of activation of the i-th shard (maybe with 0≤ai(o)≤1), and fi(o) is the action-bid of the i-th shard, i.e., the vector of log-probabilities it would like to see for different actions. Then SM is the softmax function, producing the final probabilities.
In your story, the diamond shard starts out as very strong. Let’s say it’s indexed by 0 and that a0(o)≈1 for most inputs o and that f0 has a large “capacity” at its disposal so that it will in principle be able to represent behaviors for many different tasks.
Now, if a new task pops up, like solving a maze, in a specific context om, I imagine that two things could happen to make this possible:
f0(om) could get updated to also represent this new behavior
The strength a0(o) could get weighed down and some other shard could learn to represent this new behavior.
One reason why the latter may happen is that f0 possibly becomes so complicated that it’s “hard to attach more behavior to it”; maybe it’s just simpler to create an entirely new module that solves this task and doesn’t care about diamonds. If something like this happens often enough, then eventually, the diamond shard may lose all its influence.
I don’t currently share your intuitions for this particular technical phenomenon being plausible, but imagine there are other possible reasons this could happen, so sure? I agree that there are some ways the diamond-shard could lose influence. But mostly, again, I expect this to be a quantitative question, and I think experience with people suggests that trying a fun new activity won’t wipe away your other important values.