This was a really interesting paper; however, I was left with one question. Can anyone argue why exactly the model is motivated to learn a much more complex function than the identity map? An auto-encoder whose latent space is much smaller than the input is forced to learn an interesting map; however, I can’t see why a highly over-parameterised auto-encoder wouldn’t simply learn something close to an identity map. Is it somehow the regularisation or the bias terms? I’d love to hear an argument for why the auto-encoder is likely to learn these mono-semantic features as opposed to an identity map.
It’s a sparse autoencoder because part of the loss function is an L1 penalty encouraging sparsity in the hidden layer. Otherwise, it would indeed learn a simple identity map!
This was a really interesting paper; however, I was left with one question. Can anyone argue why exactly the model is motivated to learn a much more complex function than the identity map? An auto-encoder whose latent space is much smaller than the input is forced to learn an interesting map; however, I can’t see why a highly over-parameterised auto-encoder wouldn’t simply learn something close to an identity map. Is it somehow the regularisation or the bias terms? I’d love to hear an argument for why the auto-encoder is likely to learn these mono-semantic features as opposed to an identity map.
It’s a sparse autoencoder because part of the loss function is an L1 penalty encouraging sparsity in the hidden layer. Otherwise, it would indeed learn a simple identity map!