Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation.
@Lucius Bushnaq , why would MLPs compute linear transformations?
Because two linear transformations can be combined into one linear transformation, why wouldn’t downstream MLPs/Attns that rely on this linearly transformed vector just learn the combined function?
@Lucius Bushnaq , why would MLPs compute linear transformations?
Because two linear transformations can be combined into one linear transformation, why wouldn’t downstream MLPs/Attns that rely on this linearly transformed vector just learn the combined function?