If an information channel isn’t a subcircuit, then what is an information channel? (If you just want to drop a link to some previous post of yours, that would be helpful. Googling didn’t bring up much from you specifically.) I think this must be the sticking point in our current discussion. A “scarce useful subcircuits” claim at initialization seems false to me, basically because of (the existing evidence for) the LTH.
What I mean by “full rank” was that the Jacobian would be essentially full-rank. This turns out not to be true (see below), but I also wouldn’t say that the Jacobian has O(1) rank either. Here are the singular values of the Jacobian of the last residual stream with respect to the first residual stream vector. The first plot is for the same token, (near the end of a context window of size 1024) and the second is for two tokens that are two apart (same context window).
These matrices have a bunch of singular values that are close to zero, but they also have a lot of singular values that are not that much lower than the maximum. It would take a fair amount of time to compute the Jacobian over a large number of tokens to really answer the question you posed.
If an information channel isn’t a subcircuit, then what is an information channel? (If you just want to drop a link to some previous post of yours, that would be helpful. Googling didn’t bring up much from you specifically.) I think this must be the sticking point in our current discussion. A “scarce useful subcircuits” claim at initialization seems false to me, basically because of (the existing evidence for) the LTH.
What I mean by “full rank” was that the Jacobian would be essentially full-rank. This turns out not to be true (see below), but I also wouldn’t say that the Jacobian has O(1) rank either. Here are the singular values of the Jacobian of the last residual stream with respect to the first residual stream vector. The first plot is for the same token, (near the end of a context window of size 1024) and the second is for two tokens that are two apart (same context window).
These matrices have a bunch of singular values that are close to zero, but they also have a lot of singular values that are not that much lower than the maximum. It would take a fair amount of time to compute the Jacobian over a large number of tokens to really answer the question you posed.