I don’t understand why can’t you just have some neurons which represent the former, and some neurons which represent the latter?
Because people thought you needed the same weights to 1) transport the gradients back, 2) send the activations forward. Having two distinct networks with the same topology and getting the weights to match was known as the “weight transport problem”. See Grossberg, S. 1987. Competitive learning: From interactive activation to adaptive resonance. Cognitive science 11(1):23–63.
Do you have any particular source for dropout being replaced by batch normalisation, or is it an impression from the papers you’ve been reading?
Because people thought you needed the same weights to 1) transport the gradients back, 2) send the activations forward. Having two distinct networks with the same topology and getting the weights to match was known as the “weight transport problem”. See Grossberg, S. 1987. Competitive learning: From interactive activation to adaptive resonance. Cognitive science 11(1):23–63.
The latter.