Standard feedforward DNNs encompass “circuits buildable from matmul and RELU”, however crucially the backprop gradient update necessarily includes another key operator—matrix transpose.
Transformers are a semi-special case of attention/memory augmented networks, which encompass “circuits buildable from matmul, RELU, and transpose”—and thus they incorporate dynamic multiplicative interactions which enable (at least) the ability to learn (or quickly memorize) into the forward pass.
So yes adding that transpose into the forward/inference greatly expands the space of circuits you can efficiently emulate. It’s not obvious how many more such fundamental ops one needs for AGI. Brain circuits don’t obviously have much more key func components beyond matmul, RELU, multiplicative gating interactions, and efficient sparsity. (Brains also feature many other oddities like mult/exponential updates vs linear and various related non-negative constraints, but unclear how important those are).
Standard feedforward DNNs encompass “circuits buildable from matmul and RELU”, however crucially the backprop gradient update necessarily includes another key operator—matrix transpose.
Transformers are a semi-special case of attention/memory augmented networks, which encompass “circuits buildable from matmul, RELU, and transpose”—and thus they incorporate dynamic multiplicative interactions which enable (at least) the ability to learn (or quickly memorize) into the forward pass.
So yes adding that transpose into the forward/inference greatly expands the space of circuits you can efficiently emulate. It’s not obvious how many more such fundamental ops one needs for AGI. Brain circuits don’t obviously have much more key func components beyond matmul, RELU, multiplicative gating interactions, and efficient sparsity. (Brains also feature many other oddities like mult/exponential updates vs linear and various related non-negative constraints, but unclear how important those are).