Er, maybe your notation is obscuring this for me, but how does that follow? Where is the RNN getting this special randomness from? Why aren’t the internal activations of a many-layer Transformer perfectly adequate to first encode, ‘storing z’, and then transform?
I’m assuming that either architecture can use a source of random bits.
The transformer produces one bit at a time, computing every bit from the history so far. It doesn’t have any state except for the history. At some stage of the game the history consists of y only. At this stage the transformer would have to compute z from y in order to win. It doesn’t have any activations to go on besides those that can be produced from y.
And the Transformer can recompute whatever function the RNN is computing over its history, no, as I said? Whatever a RNN can do with its potentially limited access to history, a Transformer can recompute with its full access to history as if it were the unrolled RNN. It can recompute that for every bit, generate the next one, and then recompute on the next step with that as the newest part of its history being conditioned on.
No, because the RNN is not deterministic. In order to simulate the RNN, the transformer would have to do exponentially many “Monte Carlo” iterations until it produces the right history.
An RNN is deterministic, usually (how else are you going to backprop through it to train it? not too easily), and even if it’s not, I don’t see why that would make a difference, or why a Transformer couldn’t be ‘not deterministic’ in the same sense given access to random bits (talking about stochastic units merely smuggles in bits by the back door) nor why it can’t learn ‘Monte Carlo iterations’ internally (say, one per head).
I already conceded a Transformer can be made stochastic. I don’t see a problem with backproping: you treat the random inputs as part of the environment, and there’s no issue with the environment having stochastic parts. It’s stochastic gradient descent, after all.
Because you don’t train the inputs, you’re trying to train parameters, but the gradients stop cold there if you just treat them as blackboxes, and this seems like it’s abusing the term ‘stochastic’ (what does the size of minibatches being smaller than the full dataset have to do with this?). I still don’t understand what you think Transformers are doing differently vs RNNs in terms of what kind of processing of history they are doing and why Transformers can’t meta-learn in the same way as RNNs internally.
I am not sure what do you mean by “stop cold?” It has to with minibatches, because in offline learning your datapoints can (and usually are) regarded as sampled from some IID process, and here we also have a stochastic environment (but not IID). I dont see anything unusual about this, the MDP in RL is virtually always allowed to be stochastic.
As to the other thing, I already conceded that transformers are no worse than RNNs in this sense, so you seem to be barging into an open door here?
Er, maybe your notation is obscuring this for me, but how does that follow? Where is the RNN getting this special randomness from? Why aren’t the internal activations of a many-layer Transformer perfectly adequate to first encode, ‘storing z’, and then transform?
I’m assuming that either architecture can use a source of random bits.
The transformer produces one bit at a time, computing every bit from the history so far. It doesn’t have any state except for the history. At some stage of the game the history consists of y only. At this stage the transformer would have to compute z from y in order to win. It doesn’t have any activations to go on besides those that can be produced from y.
And the Transformer can recompute whatever function the RNN is computing over its history, no, as I said? Whatever a RNN can do with its potentially limited access to history, a Transformer can recompute with its full access to history as if it were the unrolled RNN. It can recompute that for every bit, generate the next one, and then recompute on the next step with that as the newest part of its history being conditioned on.
No, because the RNN is not deterministic. In order to simulate the RNN, the transformer would have to do exponentially many “Monte Carlo” iterations until it produces the right history.
An RNN is deterministic, usually (how else are you going to backprop through it to train it? not too easily), and even if it’s not, I don’t see why that would make a difference, or why a Transformer couldn’t be ‘not deterministic’ in the same sense given access to random bits (talking about stochastic units merely smuggles in bits by the back door) nor why it can’t learn ‘Monte Carlo iterations’ internally (say, one per head).
I already conceded a Transformer can be made stochastic. I don’t see a problem with backproping: you treat the random inputs as part of the environment, and there’s no issue with the environment having stochastic parts. It’s stochastic gradient descent, after all.
Because you don’t train the inputs, you’re trying to train parameters, but the gradients stop cold there if you just treat them as blackboxes, and this seems like it’s abusing the term ‘stochastic’ (what does the size of minibatches being smaller than the full dataset have to do with this?). I still don’t understand what you think Transformers are doing differently vs RNNs in terms of what kind of processing of history they are doing and why Transformers can’t meta-learn in the same way as RNNs internally.
I am not sure what do you mean by “stop cold?” It has to with minibatches, because in offline learning your datapoints can (and usually are) regarded as sampled from some IID process, and here we also have a stochastic environment (but not IID). I dont see anything unusual about this, the MDP in RL is virtually always allowed to be stochastic.
As to the other thing, I already conceded that transformers are no worse than RNNs in this sense, so you seem to be barging into an open door here?