I think the state is encoded in activations. There is a paper which explains that although Transformers are feed-forward transducers, in the autoregressive mode they do emulate RNNs:
Section 3.4 of “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention”, https://arxiv.org/abs/2006.16236
So, the set of current activations encodes the hidden state of that “virtual RNN”.
This might be relevant to some of the discussion threads here...
Thanks.
I think the state is encoded in activations. There is a paper which explains that although Transformers are feed-forward transducers, in the autoregressive mode they do emulate RNNs:
Section 3.4 of “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention”, https://arxiv.org/abs/2006.16236
So, the set of current activations encodes the hidden state of that “virtual RNN”.
This might be relevant to some of the discussion threads here...
Thanks.