On bmag, it’s unclear what a “natural” choice would be for setting this parameter in order to simplify the architecture further. One natural reference point is to set it to ermag⊙bgate, but this corresponds to getting rid of the discontinuity in the Jump ReLU (turning the magnitude encoder into a ReLU on multiplicatively rescaled gate encoder preactivations). Effectively (removing the now unnecessary auxiliary task), this would give results similar to the “baseline + rescale & shift” benchmark in section 5.2 of the paper, although probably worse, as we wouldn’t have the shift.
On bmag, it’s unclear what a “natural” choice would be for setting this parameter in order to simplify the architecture further. One natural reference point is to set it to ermag⊙bgate, but this corresponds to getting rid of the discontinuity in the Jump ReLU (turning the magnitude encoder into a ReLU on multiplicatively rescaled gate encoder preactivations). Effectively (removing the now unnecessary auxiliary task), this would give results similar to the “baseline + rescale & shift” benchmark in section 5.2 of the paper, although probably worse, as we wouldn’t have the shift.
Makes sense that the shift would be helpful