But if GPT-3 can accomplish the same things empirically, who cares? GPT-3 is entirely reconstructing the “learned information” from the history, at every step. If it can accomplish so much this way, should we count its lack of recurrence against it?
I think that’s exactly it. There’s no real difference between a history, and a recurrence. A recurrence is a (lossy) function of a history, so anything a recurrent hidden state can encode, a sufficiently large/deep feedforward model given access to the full history should be able to internally represent as well.
GPT with a context window of 1 token would be unable to do any kind of meta-learning, in much the same way that a RNN with no hidden state (or at its first step with a default hidden state) working one step at a time would be unable to do anything. Whether you compute your meta-learning ‘horizontally’ by repeated application to a hidden state, updating token by token, or ‘vertically’ inside a deep Transformer (an unrolled RNN?) conditioned on the entire history, makes no difference aside from issues of perhaps computational efficiency (a RNN is probably faster to run but slower to train) and needing more or less parameters or layers to achieve the same effective amount of pondering time (although see Universal Transformers there).
There’s no real difference between a history, and a recurrence.
That’s true for unbounded agents but false for realistic (bounded) agents. Considering the following two-player zero-sum game:
Player A secretly writes some x∈{0,1}n, then player B says some y∈{0,1}n and finally player B says some z∈{0,1}n. Player A gets reward 1 unless y=f(z) where f:{0,1}n→{0,1}n is a fixed one-way function. If y=f(z), player A gets a reward in [0,1] which is the fraction of bits x and z have in common.
The optimal strategy for player A is producing a random sequence. The optimal strategy for player B is choosing a random z, computing y:=f(z), outputting y and then outputting z. The latter is something that an RNN can implement (by storing z in its internal state) but a stateless architecture like a transformer cannot implement. A stateless algorithm would have to recover z from y, but that is computationally unfeasible.
On second thought, that’s not a big deal: we can fix it by interspersing random bits in the input. This way, the transformer would see a history that includes yand the random bits used to produce it (which encode z). More generally, such a setup can simulate any randomized RNN.
Er, maybe your notation is obscuring this for me, but how does that follow? Where is the RNN getting this special randomness from? Why aren’t the internal activations of a many-layer Transformer perfectly adequate to first encode, ‘storing z’, and then transform?
I’m assuming that either architecture can use a source of random bits.
The transformer produces one bit at a time, computing every bit from the history so far. It doesn’t have any state except for the history. At some stage of the game the history consists of y only. At this stage the transformer would have to compute z from y in order to win. It doesn’t have any activations to go on besides those that can be produced from y.
And the Transformer can recompute whatever function the RNN is computing over its history, no, as I said? Whatever a RNN can do with its potentially limited access to history, a Transformer can recompute with its full access to history as if it were the unrolled RNN. It can recompute that for every bit, generate the next one, and then recompute on the next step with that as the newest part of its history being conditioned on.
No, because the RNN is not deterministic. In order to simulate the RNN, the transformer would have to do exponentially many “Monte Carlo” iterations until it produces the right history.
An RNN is deterministic, usually (how else are you going to backprop through it to train it? not too easily), and even if it’s not, I don’t see why that would make a difference, or why a Transformer couldn’t be ‘not deterministic’ in the same sense given access to random bits (talking about stochastic units merely smuggles in bits by the back door) nor why it can’t learn ‘Monte Carlo iterations’ internally (say, one per head).
I already conceded a Transformer can be made stochastic. I don’t see a problem with backproping: you treat the random inputs as part of the environment, and there’s no issue with the environment having stochastic parts. It’s stochastic gradient descent, after all.
Because you don’t train the inputs, you’re trying to train parameters, but the gradients stop cold there if you just treat them as blackboxes, and this seems like it’s abusing the term ‘stochastic’ (what does the size of minibatches being smaller than the full dataset have to do with this?). I still don’t understand what you think Transformers are doing differently vs RNNs in terms of what kind of processing of history they are doing and why Transformers can’t meta-learn in the same way as RNNs internally.
I am not sure what do you mean by “stop cold?” It has to with minibatches, because in offline learning your datapoints can (and usually are) regarded as sampled from some IID process, and here we also have a stochastic environment (but not IID). I dont see anything unusual about this, the MDP in RL is virtually always allowed to be stochastic.
As to the other thing, I already conceded that transformers are no worse than RNNs in this sense, so you seem to be barging into an open door here?
Practically speaking, I think the big difference is that the history is outside of GPT-3′s control, but a recurrent memory would be inside its control.
I think that’s exactly it. There’s no real difference between a history, and a recurrence. A recurrence is a (lossy) function of a history, so anything a recurrent hidden state can encode, a sufficiently large/deep feedforward model given access to the full history should be able to internally represent as well.
GPT with a context window of 1 token would be unable to do any kind of meta-learning, in much the same way that a RNN with no hidden state (or at its first step with a default hidden state) working one step at a time would be unable to do anything. Whether you compute your meta-learning ‘horizontally’ by repeated application to a hidden state, updating token by token, or ‘vertically’ inside a deep Transformer (an unrolled RNN?) conditioned on the entire history, makes no difference aside from issues of perhaps computational efficiency (a RNN is probably faster to run but slower to train) and needing more or less parameters or layers to achieve the same effective amount of pondering time (although see Universal Transformers there).
That’s true for unbounded agents but false for realistic (bounded) agents. Considering the following two-player zero-sum game:
Player A secretly writes some x∈{0,1}n, then player B says some y∈{0,1}n and finally player B says some z∈{0,1}n. Player A gets reward 1 unless y=f(z) where f:{0,1}n→{0,1}n is a fixed one-way function. If y=f(z), player A gets a reward in [0,1] which is the fraction of bits x and z have in common.
The optimal strategy for player A is producing a random sequence. The optimal strategy for player B is choosing a random z, computing y:=f(z), outputting y and then outputting z. The latter is something that an RNN can implement (by storing z in its internal state) but a stateless architecture like a transformer cannot implement. A stateless algorithm would have to recover z from y, but that is computationally unfeasible.
On second thought, that’s not a big deal: we can fix it by interspersing random bits in the input. This way, the transformer would see a history that includes y and the random bits used to produce it (which encode z). More generally, such a setup can simulate any randomized RNN.
Er, maybe your notation is obscuring this for me, but how does that follow? Where is the RNN getting this special randomness from? Why aren’t the internal activations of a many-layer Transformer perfectly adequate to first encode, ‘storing z’, and then transform?
I’m assuming that either architecture can use a source of random bits.
The transformer produces one bit at a time, computing every bit from the history so far. It doesn’t have any state except for the history. At some stage of the game the history consists of y only. At this stage the transformer would have to compute z from y in order to win. It doesn’t have any activations to go on besides those that can be produced from y.
And the Transformer can recompute whatever function the RNN is computing over its history, no, as I said? Whatever a RNN can do with its potentially limited access to history, a Transformer can recompute with its full access to history as if it were the unrolled RNN. It can recompute that for every bit, generate the next one, and then recompute on the next step with that as the newest part of its history being conditioned on.
No, because the RNN is not deterministic. In order to simulate the RNN, the transformer would have to do exponentially many “Monte Carlo” iterations until it produces the right history.
An RNN is deterministic, usually (how else are you going to backprop through it to train it? not too easily), and even if it’s not, I don’t see why that would make a difference, or why a Transformer couldn’t be ‘not deterministic’ in the same sense given access to random bits (talking about stochastic units merely smuggles in bits by the back door) nor why it can’t learn ‘Monte Carlo iterations’ internally (say, one per head).
I already conceded a Transformer can be made stochastic. I don’t see a problem with backproping: you treat the random inputs as part of the environment, and there’s no issue with the environment having stochastic parts. It’s stochastic gradient descent, after all.
Because you don’t train the inputs, you’re trying to train parameters, but the gradients stop cold there if you just treat them as blackboxes, and this seems like it’s abusing the term ‘stochastic’ (what does the size of minibatches being smaller than the full dataset have to do with this?). I still don’t understand what you think Transformers are doing differently vs RNNs in terms of what kind of processing of history they are doing and why Transformers can’t meta-learn in the same way as RNNs internally.
I am not sure what do you mean by “stop cold?” It has to with minibatches, because in offline learning your datapoints can (and usually are) regarded as sampled from some IID process, and here we also have a stochastic environment (but not IID). I dont see anything unusual about this, the MDP in RL is virtually always allowed to be stochastic.
As to the other thing, I already conceded that transformers are no worse than RNNs in this sense, so you seem to be barging into an open door here?
Practically speaking, I think the big difference is that the history is outside of GPT-3′s control, but a recurrent memory would be inside its control.