(To me one important aspect is whether computation is fundamentally limited to a fixed number of steps vs. having a potentially unbounded loop.
The autoregressive version is an interesting compromise: it’s a fixed number of steps per token, but the answer can unfold in an unbounded fashion.
An interesting tid-bit here is that for traditional RNNs it is one loop iteration per an input token, but in autoregressive Transformers it is one loop iteration per an output token.)
“Hooked up to something” might make a difference.
(To me one important aspect is whether computation is fundamentally limited to a fixed number of steps vs. having a potentially unbounded loop.
The autoregressive version is an interesting compromise: it’s a fixed number of steps per token, but the answer can unfold in an unbounded fashion.
An interesting tid-bit here is that for traditional RNNs it is one loop iteration per an input token, but in autoregressive Transformers it is one loop iteration per an output token.)