Sure, each neuron reads from TnlogMMof the random subspaces. But in all but k of those subspaces, the big network’s activations are smaller than δ, right? So I was expecting a tighter bound—something like:
EDIT: Sorry, misunderstood your question at first.
Even if δ=0, all those subspaces will have some nonzero overlap O(1√D) with the activation vectors of the k active subnets. The subspaces of the different small networks in the residual stream aren’t orthogonal.
Ah, I think I understand. Let me write it out to double-check, and in case it helps others.
Say δ=0, for simplicity. Then Al=∑tEtalt . This sum has k nonzero terms.
In your construction, Wl,in=∑tVltWl,intETt. Focussing on a single neuron, labelled by i, we have (Wl,in)i=∑t(Vlt)iWl,intETt. This sum has ∼pT nonzero terms.
So the preactivation of an MLP hidden neuron in the big network is pli=∑t,t′(Vlt)iWl,intETtEt′alt′ . This sum has ∼kpT nonzero terms.
We only “want” the terms where t=t′; the rest (i.e. the majority) are noise. Each noise term in the sum is a random vector, so each of the ∼kpT different noise terms are roughly orthogonal, and so the norm of the noise is O(√kpT) (times some other factors, but this captures the T-dependence, which is what I was confused about).
I’m confused by the read-in bound:
Sure, each neuron reads from TnlogMMof the random subspaces. But in all but k of those subspaces, the big network’s activations are smaller than δ, right? So I was expecting a tighter bound—something like:
ϵl,int=O(wa√(k+Tδ)mdMDlogM)
EDIT: Sorry, misunderstood your question at first.
Even if δ=0, all those subspaces will have some nonzero overlap O(1√D) with the activation vectors of the k active subnets. The subspaces of the different small networks in the residual stream aren’t orthogonal.
Ah, I think I understand. Let me write it out to double-check, and in case it helps others.
Say δ=0, for simplicity. Then Al=∑tEtalt . This sum has k nonzero terms.
In your construction, Wl,in=∑tVltWl,intETt. Focussing on a single neuron, labelled by i, we have (Wl,in)i=∑t(Vlt)iWl,intETt. This sum has ∼pT nonzero terms.
So the preactivation of an MLP hidden neuron in the big network is pli=∑t,t′(Vlt)iWl,intETtEt′alt′ . This sum has ∼kpT nonzero terms.
We only “want” the terms where t=t′; the rest (i.e. the majority) are noise. Each noise term in the sum is a random vector, so each of the ∼kpT different noise terms are roughly orthogonal, and so the norm of the noise is O(√kpT) (times some other factors, but this captures the T-dependence, which is what I was confused about).
Yes, that’s right.