Vanessa Kosoy comments on Mesa-Search vs Mesa-Control

Vanessa Kosoy 19 Aug 2020 19:03 UTC
LW: 3 AF: 1
AF
No, because the RNN is not deterministic. In order to simulate the RNN, the transformer would have to do exponentially many “Monte Carlo” iterations until it produces the right history.
- gwern 31 Oct 2020 18:50 UTC
  LW: 2 AF: 1
  AF Parent
  An RNN is deterministic, usually (how else are you going to backprop through it to train it? not too easily), and even if it’s not, I don’t see why that would make a difference, or why a Transformer couldn’t be ‘not deterministic’ in the same sense given access to random bits (talking about stochastic units merely smuggles in bits by the back door) nor why it can’t learn ‘Monte Carlo iterations’ internally (say, one per head).
  - Vanessa Kosoy 31 Oct 2020 18:59 UTC
    LW: 2 AF: 1
    AF Parent
    I already conceded a Transformer can be made stochastic. I don’t see a problem with backproping: you treat the random inputs as part of the environment, and there’s no issue with the environment having stochastic parts. It’s stochastic gradient descent, after all.
    - gwern 31 Oct 2020 23:18 UTC
      LW: 2 AF: 1
      AF Parent
      Because you don’t train the inputs, you’re trying to train parameters, but the gradients stop cold there if you just treat them as blackboxes, and this seems like it’s abusing the term ‘stochastic’ (what does the size of minibatches being smaller than the full dataset have to do with this?). I still don’t understand what you think Transformers are doing differently vs RNNs in terms of what kind of processing of history they are doing and why Transformers can’t meta-learn in the same way as RNNs internally.
      - Vanessa Kosoy 1 Nov 2020 7:26 UTC
        LW: 2 AF: 1
        AF Parent
        I am not sure what do you mean by “stop cold?” It has to with minibatches, because in offline learning your datapoints can (and usually are) regarded as sampled from some IID process, and here we also have a stochastic environment (but not IID). I dont see anything unusual about this, the MDP in RL is virtually always allowed to be stochastic.
        
        As to the other thing, I already conceded that transformers are no worse than RNNs in this sense, so you seem to be barging into an open door here?