I’m fairly sure that there’s architectures where each layer is a linear function of the concatenated activations of all previous layers, though I can’t seem to find it right now. If you add possible sparsity to that, then I think you get a fully general DAG.
I’m fairly sure that there’s architectures where each layer is a linear function of the concatenated activations of all previous layers, though I can’t seem to find it right now. If you add possible sparsity to that, then I think you get a fully general DAG.