Are you saying that such a mechanism occurs by coincidence, or that it’s actively constructed?
It seems like for all the intermediate steps, all consumers of the almost-identical subnetworks would naturally just pick one and use that one, since it was slightly closer to what the consumer needed.
Suppose that each subnetwork does general reasoning and thus up until some point during training the subnetworks are useful for minimizing loss.
Are you saying that such a mechanism occurs by coincidence, or that it’s actively constructed? It seems like for all the intermediate steps, all consumers of the almost-identical subnetworks would naturally just pick one and use that one, since it was slightly closer to what the consumer needed.