Great post! I was wondering if the conclusion to be drawn is really that « dropout inibits superpositon »?
My prior was that it should increase it (so this post proved me wrong on this part) mainly because in a model with mlp in parallel (like transformer) deopout would force redundancy of circuit, not inside one mlp, but across different mlps
Id like to see more on that, it would be super useful to know that dropout helps or not interpretability to enforce it or not on training
I haven’t yet explored the case you describe with transformers. In general I suspect the effect will be more complex with multilayer transformers than it is for simple autoencoders. However, in the simple case explored so far, it seems this conclusion can be drawn. Dropout is essentially adding variability in the direction that each feature is represented by, thus features can’t be as ‘packed’.
One thing I just thought about: I would predict that dropout is reducing superposition in parallel and augment superposition in series (because to make sure that the function is computed, you can have redundancy)
In a MLP, the nodes from different layers are in Series (you need to go through the first, and then the second), but inside the same layer they are in Parallel (you go through one of the other).
The analogy is with electrical systems, but I was mostly thinking in terms of LLM components: the MLPs and Attentions are in Series (you go through the first and after through the second), but inside one component, they are in parallel.
I guess that then, inside a component there is less superposition (evidence is this post), and between component there is redundancy (so if a computation fails somewhere, it is done also somewhere else).
In general, dropout makes me feel like because some part of the network are not going to work, the network has to implement “independent” component for it to compute thing properly.
Great post! I was wondering if the conclusion to be drawn is really that « dropout inibits superpositon »? My prior was that it should increase it (so this post proved me wrong on this part) mainly because in a model with mlp in parallel (like transformer) deopout would force redundancy of circuit, not inside one mlp, but across different mlps
Id like to see more on that, it would be super useful to know that dropout helps or not interpretability to enforce it or not on training
Thanks!
I haven’t yet explored the case you describe with transformers. In general I suspect the effect will be more complex with multilayer transformers than it is for simple autoencoders. However, in the simple case explored so far, it seems this conclusion can be drawn. Dropout is essentially adding variability in the direction that each feature is represented by, thus features can’t be as ‘packed’.
One thing I just thought about: I would predict that dropout is reducing superposition in parallel and augment superposition in series (because to make sure that the function is computed, you can have redundancy)
What do you mean exactly by ‘in parallel’ and ‘in series’ ?
In a MLP, the nodes from different layers are in Series (you need to go through the first, and then the second), but inside the same layer they are in Parallel (you go through one of the other).
The analogy is with electrical systems, but I was mostly thinking in terms of LLM components: the MLPs and Attentions are in Series (you go through the first and after through the second), but inside one component, they are in parallel.
I guess that then, inside a component there is less superposition (evidence is this post), and between component there is redundancy (so if a computation fails somewhere, it is done also somewhere else).
In general, dropout makes me feel like because some part of the network are not going to work, the network has to implement “independent” component for it to compute thing properly.