Briefly read a Chat-GPT description of Transformer-XL—is this essentially long term memory? Are there computations an LSTM could do that a Transformer-XL couldn’t?
There is still technically a limit to how far back a Transformer-XL can see since each layer can only attend to previous keys/values computed by that layer. As a result, the receptive field of layer L can only be as wide as the last L context windows. I guess this means that there might be some things that LSTMs can do that Transformer-XL can’t, but this can be fixed with a couple of minor modifications to Transformer-XL. For example, this paper fixes the problem by allowing layers to attend to the outputs of later layers from previous context windows, which should make the receptive field (at least theoretically) infinitely long, meaning it should probably be able to do everything an LSTM can.
Briefly read a Chat-GPT description of Transformer-XL—is this essentially long term memory? Are there computations an LSTM could do that a Transformer-XL couldn’t?
There is still technically a limit to how far back a Transformer-XL can see since each layer can only attend to previous keys/values computed by that layer. As a result, the receptive field of layer L can only be as wide as the last L context windows. I guess this means that there might be some things that LSTMs can do that Transformer-XL can’t, but this can be fixed with a couple of minor modifications to Transformer-XL. For example, this paper fixes the problem by allowing layers to attend to the outputs of later layers from previous context windows, which should make the receptive field (at least theoretically) infinitely long, meaning it should probably be able to do everything an LSTM can.