I haven’t yet explored the case you describe with transformers. In general I suspect the effect will be more complex with multilayer transformers than it is for simple autoencoders. However, in the simple case explored so far, it seems this conclusion can be drawn. Dropout is essentially adding variability in the direction that each feature is represented by, thus features can’t be as ‘packed’.
One thing I just thought about: I would predict that dropout is reducing superposition in parallel and augment superposition in series (because to make sure that the function is computed, you can have redundancy)
In a MLP, the nodes from different layers are in Series (you need to go through the first, and then the second), but inside the same layer they are in Parallel (you go through one of the other).
The analogy is with electrical systems, but I was mostly thinking in terms of LLM components: the MLPs and Attentions are in Series (you go through the first and after through the second), but inside one component, they are in parallel.
I guess that then, inside a component there is less superposition (evidence is this post), and between component there is redundancy (so if a computation fails somewhere, it is done also somewhere else).
In general, dropout makes me feel like because some part of the network are not going to work, the network has to implement “independent” component for it to compute thing properly.
Thanks!
I haven’t yet explored the case you describe with transformers. In general I suspect the effect will be more complex with multilayer transformers than it is for simple autoencoders. However, in the simple case explored so far, it seems this conclusion can be drawn. Dropout is essentially adding variability in the direction that each feature is represented by, thus features can’t be as ‘packed’.
One thing I just thought about: I would predict that dropout is reducing superposition in parallel and augment superposition in series (because to make sure that the function is computed, you can have redundancy)
What do you mean exactly by ‘in parallel’ and ‘in series’ ?
In a MLP, the nodes from different layers are in Series (you need to go through the first, and then the second), but inside the same layer they are in Parallel (you go through one of the other).
The analogy is with electrical systems, but I was mostly thinking in terms of LLM components: the MLPs and Attentions are in Series (you go through the first and after through the second), but inside one component, they are in parallel.
I guess that then, inside a component there is less superposition (evidence is this post), and between component there is redundancy (so if a computation fails somewhere, it is done also somewhere else).
In general, dropout makes me feel like because some part of the network are not going to work, the network has to implement “independent” component for it to compute thing properly.