I am not sure what do you mean by “stop cold?” It has to with minibatches, because in offline learning your datapoints can (and usually are) regarded as sampled from some IID process, and here we also have a stochastic environment (but not IID). I dont see anything unusual about this, the MDP in RL is virtually always allowed to be stochastic.
As to the other thing, I already conceded that transformers are no worse than RNNs in this sense, so you seem to be barging into an open door here?
I am not sure what do you mean by “stop cold?” It has to with minibatches, because in offline learning your datapoints can (and usually are) regarded as sampled from some IID process, and here we also have a stochastic environment (but not IID). I dont see anything unusual about this, the MDP in RL is virtually always allowed to be stochastic.
As to the other thing, I already conceded that transformers are no worse than RNNs in this sense, so you seem to be barging into an open door here?