Key hypothesis: neural nets or brains are typically initialized in a “scarce channels” regime. A randomly initialized neural net generally throws out approximately-all information by default (at initialization), as opposed to passing lots of information around to lots of parts of the net.
Just to make sure I’m understanding this correctly, you’re claiming that the mutual information between the input and the output of a randomly initialized network is low, where we have some input distribution and treat the network weights as fixed? (You also seem to make similar claims about things inside the network, but I’ll just focus on input-output mutual information)
I think we can construct toy examples where that’s false. E.g. use a feedforward MLP with any bijective activation function and where input, output, and all hidden layers have the same dimensionality (so the linear transforms are all described by random square matrices). Since a random square matrix will be invertible with probability one, this entire network is invertible at random initialization, so the mutual information between input and output is maximal (the entropy of the input).
These are unrealistic assumptions (though I think the argument should still work as long as the hidden layers aren’t lower-dimensional than the input). In practice, the hidden dimensionality will often be lower than that of the input of course, but then it seems to me like that’s the key, not the random initialization. (Mutual information would still be maximal for the architecture, I think). Maybe using ReLUs instead of bijective activations messes all of this up? Would be really weird though if ReLUs vs Tanh were the key as to whether network internals mirror the external abstractions.
My take on what’s going on here is that at random initialization, the neural network doesn’t pass around information in an easily usable way. I’m just arguing that mutual information doesn’t really capture this and we need some other formalization (maybe along the lines of this: https://arxiv.org/abs/2002.10689 ). I don’t have a strong opinion how much that changes the picture, but I’m at least hesitant to trust arguments based on mutual information if we ultimately want some other information measure we haven’t defined yet.
My take on what’s going on here is that at random initialization, the neural network doesn’t pass around information in an easily usable way. I’m just arguing that mutual information doesn’t really capture this and we need some other formalization
Yup, I think that’s probably basically correct for neural nets, at least viewing them in the simplest way. I do think there are clever ways of modeling nets which would probably make mutual information a viable modeling choice—in particular, treat the weights as unknown, so we’re talking about mutual information not conditional on the weights. But that approach isn’t obviously computationally tractable, so probably something else would be more useful.
Just to make sure I’m understanding this correctly, you’re claiming that the mutual information between the input and the output of a randomly initialized network is low, where we have some input distribution and treat the network weights as fixed? (You also seem to make similar claims about things inside the network, but I’ll just focus on input-output mutual information)
I think we can construct toy examples where that’s false. E.g. use a feedforward MLP with any bijective activation function and where input, output, and all hidden layers have the same dimensionality (so the linear transforms are all described by random square matrices). Since a random square matrix will be invertible with probability one, this entire network is invertible at random initialization, so the mutual information between input and output is maximal (the entropy of the input).
These are unrealistic assumptions (though I think the argument should still work as long as the hidden layers aren’t lower-dimensional than the input). In practice, the hidden dimensionality will often be lower than that of the input of course, but then it seems to me like that’s the key, not the random initialization. (Mutual information would still be maximal for the architecture, I think). Maybe using ReLUs instead of bijective activations messes all of this up? Would be really weird though if ReLUs vs Tanh were the key as to whether network internals mirror the external abstractions.
My take on what’s going on here is that at random initialization, the neural network doesn’t pass around information in an easily usable way. I’m just arguing that mutual information doesn’t really capture this and we need some other formalization (maybe along the lines of this: https://arxiv.org/abs/2002.10689 ). I don’t have a strong opinion how much that changes the picture, but I’m at least hesitant to trust arguments based on mutual information if we ultimately want some other information measure we haven’t defined yet.
Yup, I think that’s probably basically correct for neural nets, at least viewing them in the simplest way. I do think there are clever ways of modeling nets which would probably make mutual information a viable modeling choice—in particular, treat the weights as unknown, so we’re talking about mutual information not conditional on the weights. But that approach isn’t obviously computationally tractable, so probably something else would be more useful.