ErickBall comments on human psycholinguists: a critical appraisal

ErickBall 5 Jan 2020 16:29 UTC
2 points
Wait, why is it impossible for a fully-connected network trained by backpropagation to generalize across unseen input and output nodes? Is this supposed to be obvious?
- nostalgebraist 5 Jan 2020 19:32 UTC
  4 points
  Parent
  For the full argument from Marcus, read the parts about “training independence” in The Algebraic Mind ch. 2, or in the paper it draws from, “Rethinking Eliminative Connectionism.”
  The gist is really simple, though. First, note that if some input node is always zero during training, that’s equivalent to it not being there at all: their contribution to the input of any node in the first hidden layer is the relevant weight times zero, which is zero. Likewise, the gradient of anything w/r/t these weights is zero (because you’ll always multiply by zero when doing the chain rule), so they’ll never get updated from their initial values.
  Then observe that, if the nodes are any nonzero constant value during training, the connections add a constant to the first hidden layer inputs instead of zero. But we already have a parameter for an additive constant in a hidden layer input: the “bias.” So if the input node is supposed to carry some information, the network still can’t learn what it is; it just thinks it’s updating the bias. (Indeed, you can go the other way and rewrite the bias as an extra input node that’s always constant, or as N such nodes.)
  The argument for constant outputs is even simpler: the network will just set the weights and bias to something that always yields the right constant. For example, it’d work to set the weights to zero and the bias to $f^{- 1} (c)$ where $f$ is the activation function and $c$ is the constant. If the output has any relationship to the input then this is wrong, but the training data plus the update rule give you no reason to reject it.
  None of this is controversial and it does indeed become obvious once you think about it enough; this kind of idea is much of the rationale for weight sharing, which sets the weights for constant input nodes using patterns learned from non-constant ones rather than randomly/arbitrarily.