I calculated mutual information using this formula: https://stats.stackexchange.com/a/438613 , between Gaussian approximations to a randomly initialized GPT2-small-sized model and GPT2 itself, at all levels of the residual stream.
These numbers seem rather high to me? I’m not sure how valid this is, and it’s kind of surprising to me on first glance. I’ll try to post a clean colab in like an hour or so.
I can’t tell whether it’s a real thing or whether it’s just approximation error in the empirical covariance. The more points you estimate with, the lower the mutual information goes, but it seems to be asymptoting above zero AFAICT:
The big question is what the distribution of eigenvalues (or equivalently singular values) of that covariance matrix looks like. If it’s dominated by one or three big values, then what we’re seeing is basically one or three main information-channels which touch basically-all the nodes, but then the nodes are roughly independent conditional on those few channels. If the distribution drops off slowly (and the matrix can’t be permuted to something roughly block diagonal), then we’re in scarce modules world.
Also, did you say you’re taking correlations between the initialized net and the trained net? Is the idea there to use the trained net as a proxy for abstractions in the environment?
Yes, I was using GPT2-small as a proxy for knowledge of the environment.
The covariance matrix of the residual stream has the structure you suggest ( a few large eigenvalues) but I don’t really see why that’s evidence for sparse channels? In my mind, there is a sharp distinction between what I’m saying (every possible communication channel is open, but they tend to point in similar average directions and thus the eigenvalues of the residual stream covariance are unbalanced) and what I understand you to be saying (there are few channels).
In a transformer at initialization, the attention pattern is very close to uniform. So, to a first approximation, each attention operation is W_O * W_V (both of which matrices one can check have slowly declining singular values at initialization) * the average of the residual stream values at every previous token. The MLP’s are initialized to do smallish and pretty random things to the information AFAICT, and anyway are limited to the current token.
Given this picture, intervening at any node of the computation graph (say, offsetting it by a vector δ) will always cause a small but full-rank update at every node that is downstream of that node (i.e., every residual stream vector at every token that isn’t screened off by causal masking). This seems to me like the furthest possible one can go along the sparse modules direction of this particular axis? Like, anything that you can say about there being sparse channels seems more true of the trained transformer than the initialized transformer.
Backing out of the details of transformers, my understanding is that people still mostly believe in the Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635) for most neural network architectures. The Lottery Ticket Hypothesis seems diametrically opposed to the claim you are making; in the LTH, the network is initialized with a set of channels that is very close to being a correct model of the environment (a “lottery ticket”) and learning consists primarily of getting rid of everything else, with some small amount of cleaning up the weights of the lottery ticket. (LIke a sculptor chipping away everything but the figure.) Do you have any thoughts on how your claim interacts with the LTH, or on the LTH itself?
How would the rest of your argument change if we reliably start out in a scarce modules setting? I imagine there are still plenty of interesting consequences.
Yes, I was using GPT2-small as a proxy for knowledge of the environment.
Clever.
Given this picture, intervening at any node of the computation graph (say, offsetting it by a vector δ) will always cause a small but full-rank update at every node that is downstream of that node (i.e., every residual stream vector at every token that isn’t screened off by causal masking). This seems to me like the furthest possible one can go along the sparse modules direction of this particular axis?
Not quite. First, the update at downstream nodes induced by a delta in one upstream node can’t be “full rank”, because it isn’t a matrix; it’s a vector. The corresponding matrix of interest would be the change in each downstream node induced by a small change in each upstream node (with rows indexed by upstream nodes and cols indexed by downstream nodes). The question would be whether small changes to many different upstream nodes induce approximately the same (i.e. approximately proportional, or more generally approximately within one low-dimensional space) changes in a bunch of downstream nodes. That question would be answered by looking at the approximate rank (i.e. eigenvalues) of the matrix.
Whether that gives the same answer as looking at the eigenvalues of a covariance matrix depends mostly on how “smooth” things are within the net. I’d expect it to be very smooth in practice (in the relevant sense), because otherwise gradients explode, so the answer should be pretty similar whether looking at the covariance matrix or the matrix of derivatives.
The covariance matrix of the residual stream has the structure you suggest ( a few large eigenvalues) but I don’t really see why that’s evidence for sparse channels? In my mind, there is a sharp distinction between what I’m saying (every possible communication channel is open, but they tend to point in similar average directions and thus the eigenvalues of the residual stream covariance are unbalanced) and what I understand you to be saying (there are few channels).
Roughly speaking, if there’s lots of different channels, then the information carried over those channels should be roughly independent/uncorrelated (or at least only weakly correlated). After all, if the info carried by the channels is highly correlated, then it’s (approximately) the same information, so we really only have one channel. If we find one big super-correlated component over everything, and everything else is basically independent after controlling for that one (i.e. the rest of the eigs are completely uniform) that means there’s one big channel.
What a sparse-modules world looks like is that there’s lots of medium-sized eigenvectors—intuitively, lots of different “dimensions” of information which are each spread over medium amounts of the system, and any given medium-sized chunk of the system touches a whole bunch of those channels.
Backing out of the details of transformers, my understanding is that people still mostly believe in the Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635) for most neural network architectures. The Lottery Ticket Hypothesis seems diametrically opposed to the claim you are making; in the LTH, the network is initialized with a set of channels that is very close to being a correct model of the environment (a “lottery ticket”) and learning consists primarily of getting rid of everything else, with some small amount of cleaning up the weights of the lottery ticket. (LIke a sculptor chipping away everything but the figure.) Do you have any thoughts on how your claim interacts with the LTH, or on the LTH itself?
Insofar as the LTH works at all, I definitely do not think it works when thinking of the “lottery tickets” as information channels. Usually people say the “tickets” are subcircuits, which are very different beasts from information channels.
How would the rest of your argument change if we reliably start out in a scarce modules setting? I imagine there are still plenty of interesting consequences.
In a sparse modules setting, it’s costly to get a reusable part—something which “does the same thing” across many different settings. So in order to get a component which does the same thing in many settings, we would need many bits of selection pressure from the environment, in the form of something similar happening repeatedly across different environmental contexts.
If an information channel isn’t a subcircuit, then what is an information channel? (If you just want to drop a link to some previous post of yours, that would be helpful. Googling didn’t bring up much from you specifically.) I think this must be the sticking point in our current discussion. A “scarce useful subcircuits” claim at initialization seems false to me, basically because of (the existing evidence for) the LTH.
What I mean by “full rank” was that the Jacobian would be essentially full-rank. This turns out not to be true (see below), but I also wouldn’t say that the Jacobian has O(1) rank either. Here are the singular values of the Jacobian of the last residual stream with respect to the first residual stream vector. The first plot is for the same token, (near the end of a context window of size 1024) and the second is for two tokens that are two apart (same context window).
These matrices have a bunch of singular values that are close to zero, but they also have a lot of singular values that are not that much lower than the maximum. It would take a fair amount of time to compute the Jacobian over a large number of tokens to really answer the question you posed.
I calculated mutual information using this formula: https://stats.stackexchange.com/a/438613 , between Gaussian approximations to a randomly initialized GPT2-small-sized model and GPT2 itself, at all levels of the residual stream.
Here are the results:
These numbers seem rather high to me? I’m not sure how valid this is, and it’s kind of surprising to me on first glance. I’ll try to post a clean colab in like an hour or so.
I can’t tell whether it’s a real thing or whether it’s just approximation error in the empirical covariance. The more points you estimate with, the lower the mutual information goes, but it seems to be asymptoting above zero AFAICT:
https://colab.research.google.com/drive/1cQNXFTQVV_Xc2-PCQn7OnEdz0mpcMAlz?usp=sharing
The big question is what the distribution of eigenvalues (or equivalently singular values) of that covariance matrix looks like. If it’s dominated by one or three big values, then what we’re seeing is basically one or three main information-channels which touch basically-all the nodes, but then the nodes are roughly independent conditional on those few channels. If the distribution drops off slowly (and the matrix can’t be permuted to something roughly block diagonal), then we’re in scarce modules world.
Also, did you say you’re taking correlations between the initialized net and the trained net? Is the idea there to use the trained net as a proxy for abstractions in the environment?
Yes, I was using GPT2-small as a proxy for knowledge of the environment.
The covariance matrix of the residual stream has the structure you suggest ( a few large eigenvalues) but I don’t really see why that’s evidence for sparse channels? In my mind, there is a sharp distinction between what I’m saying (every possible communication channel is open, but they tend to point in similar average directions and thus the eigenvalues of the residual stream covariance are unbalanced) and what I understand you to be saying (there are few channels).
In a transformer at initialization, the attention pattern is very close to uniform. So, to a first approximation, each attention operation is W_O * W_V (both of which matrices one can check have slowly declining singular values at initialization) * the average of the residual stream values at every previous token. The MLP’s are initialized to do smallish and pretty random things to the information AFAICT, and anyway are limited to the current token.
Given this picture, intervening at any node of the computation graph (say, offsetting it by a vector δ) will always cause a small but full-rank update at every node that is downstream of that node (i.e., every residual stream vector at every token that isn’t screened off by causal masking). This seems to me like the furthest possible one can go along the sparse modules direction of this particular axis? Like, anything that you can say about there being sparse channels seems more true of the trained transformer than the initialized transformer.
Backing out of the details of transformers, my understanding is that people still mostly believe in the Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635) for most neural network architectures. The Lottery Ticket Hypothesis seems diametrically opposed to the claim you are making; in the LTH, the network is initialized with a set of channels that is very close to being a correct model of the environment (a “lottery ticket”) and learning consists primarily of getting rid of everything else, with some small amount of cleaning up the weights of the lottery ticket. (LIke a sculptor chipping away everything but the figure.) Do you have any thoughts on how your claim interacts with the LTH, or on the LTH itself?
How would the rest of your argument change if we reliably start out in a scarce modules setting? I imagine there are still plenty of interesting consequences.
Clever.
Not quite. First, the update at downstream nodes induced by a delta in one upstream node can’t be “full rank”, because it isn’t a matrix; it’s a vector. The corresponding matrix of interest would be the change in each downstream node induced by a small change in each upstream node (with rows indexed by upstream nodes and cols indexed by downstream nodes). The question would be whether small changes to many different upstream nodes induce approximately the same (i.e. approximately proportional, or more generally approximately within one low-dimensional space) changes in a bunch of downstream nodes. That question would be answered by looking at the approximate rank (i.e. eigenvalues) of the matrix.
Whether that gives the same answer as looking at the eigenvalues of a covariance matrix depends mostly on how “smooth” things are within the net. I’d expect it to be very smooth in practice (in the relevant sense), because otherwise gradients explode, so the answer should be pretty similar whether looking at the covariance matrix or the matrix of derivatives.
Roughly speaking, if there’s lots of different channels, then the information carried over those channels should be roughly independent/uncorrelated (or at least only weakly correlated). After all, if the info carried by the channels is highly correlated, then it’s (approximately) the same information, so we really only have one channel. If we find one big super-correlated component over everything, and everything else is basically independent after controlling for that one (i.e. the rest of the eigs are completely uniform) that means there’s one big channel.
What a sparse-modules world looks like is that there’s lots of medium-sized eigenvectors—intuitively, lots of different “dimensions” of information which are each spread over medium amounts of the system, and any given medium-sized chunk of the system touches a whole bunch of those channels.
Insofar as the LTH works at all, I definitely do not think it works when thinking of the “lottery tickets” as information channels. Usually people say the “tickets” are subcircuits, which are very different beasts from information channels.
In a sparse modules setting, it’s costly to get a reusable part—something which “does the same thing” across many different settings. So in order to get a component which does the same thing in many settings, we would need many bits of selection pressure from the environment, in the form of something similar happening repeatedly across different environmental contexts.
If an information channel isn’t a subcircuit, then what is an information channel? (If you just want to drop a link to some previous post of yours, that would be helpful. Googling didn’t bring up much from you specifically.) I think this must be the sticking point in our current discussion. A “scarce useful subcircuits” claim at initialization seems false to me, basically because of (the existing evidence for) the LTH.
What I mean by “full rank” was that the Jacobian would be essentially full-rank. This turns out not to be true (see below), but I also wouldn’t say that the Jacobian has O(1) rank either. Here are the singular values of the Jacobian of the last residual stream with respect to the first residual stream vector. The first plot is for the same token, (near the end of a context window of size 1024) and the second is for two tokens that are two apart (same context window).
These matrices have a bunch of singular values that are close to zero, but they also have a lot of singular values that are not that much lower than the maximum. It would take a fair amount of time to compute the Jacobian over a large number of tokens to really answer the question you posed.