Yup, that’s roughly what I was picturing. (Really I was picturing a smooth approximation of that, but the conclusion is the same regardless.)
In general, “shouldn’t stray from the center of the surface by a symmetry argument” definitely should not work for GD in practice—either because numerical noise knocks us off the line-where-both-are-equal, or because the line-where-both-are-equal itself curves.
So, unless the line-where-both-are-equal is perfectly linear and the numerics are perfectly symmetric all the way to the lowest bits, GD will need to take steps of size ~epsilon to stay near the center of the surface.
There should be a fair bit more than 2 epsilon of leeway in the line of equality. Since the submodules themselves are learned by SGD, they won’t be exactly equal. Most likely, the model will include dropout as well. Thus, the signals sent to the combining function are almost always more different than the limits of numerical precision allow. This mean the combining function will need quite a bit of leeway, otherwise the network’s performance is just zero always.
Yup, that’s roughly what I was picturing. (Really I was picturing a smooth approximation of that, but the conclusion is the same regardless.)
In general, “shouldn’t stray from the center of the surface by a symmetry argument” definitely should not work for GD in practice—either because numerical noise knocks us off the line-where-both-are-equal, or because the line-where-both-are-equal itself curves.
So, unless the line-where-both-are-equal is perfectly linear and the numerics are perfectly symmetric all the way to the lowest bits, GD will need to take steps of size ~epsilon to stay near the center of the surface.
There should be a fair bit more than 2 epsilon of leeway in the line of equality. Since the submodules themselves are learned by SGD, they won’t be exactly equal. Most likely, the model will include dropout as well. Thus, the signals sent to the combining function are almost always more different than the limits of numerical precision allow. This mean the combining function will need quite a bit of leeway, otherwise the network’s performance is just zero always.