WCargo comments on Superposition and Dropout

WCargo 27 May 2023 16:55 UTC
1 point
0
Great post! I was wondering if the conclusion to be drawn is really that « dropout inibits superpositon »? My prior was that it should increase it (so this post proved me wrong on this part) mainly because in a model with mlp in parallel (like transformer) deopout would force redundancy of circuit, not inside one mlp, but across different mlps

Id like to see more on that, it would be super useful to know that dropout helps or not interpretability to enforce it or not on training
- Edoardo Pona 30 May 2023 10:08 UTC
  2 points
  0
  Parent
  Thanks!
  I haven’t yet explored the case you describe with transformers. In general I suspect the effect will be more complex with multilayer transformers than it is for simple autoencoders. However, in the simple case explored so far, it seems this conclusion can be drawn. Dropout is essentially adding variability in the direction that each feature is represented by, thus features can’t be as ‘packed’.
  - WCargo 29 Jun 2023 7:23 UTC
    1 point
    0
    Parent
    One thing I just thought about: I would predict that dropout is reducing superposition in parallel and augment superposition in series (because to make sure that the function is computed, you can have redundancy)
    - Edoardo Pona 30 Jun 2023 8:41 UTC
      2 points
      0
      Parent
      What do you mean exactly by ‘in parallel’ and ‘in series’ ?
      - WCargo 3 Jul 2023 12:40 UTC
        1 point
        0
        Parent
        In a MLP, the nodes from different layers are in Series (you need to go through the first, and then the second), but inside the same layer they are in Parallel (you go through one of the other).
        
        The analogy is with electrical systems, but I was mostly thinking in terms of LLM components: the MLPs and Attentions are in Series (you go through the first and after through the second), but inside one component, they are in parallel.
        
        I guess that then, inside a component there is less superposition (evidence is this post), and between component there is redundancy (so if a computation fails somewhere, it is done also somewhere else).
        
        In general, dropout makes me feel like because some part of the network are not going to work, the network has to implement “independent” component for it to compute thing properly.