Second, Lomborg’s 2024 paper finds evidence that women time their births to just before a wage growth plateau. The evidence it gives again comes from IVF failures. Women who were planning to have a birth, but never succeed, have much flatter wage growth after their planned birth year, even though they didn’t actually have any kids. So the divergence between childrearing mothers and non-childbearing mothers shows up even in this placebo case when neither group actually had kids. Therefore, the event study is overstating the earnings impact of childbirth.
Couldn’t it also be that the women in question plan their career based on the expectation to have children and this is what leads to the plateau? In that case it seems like it would be incorrect to interpret these results as evidence against a child penalty, as it’s merely that the child penalty affects women regardless of whether they have the children. To check, I think you should ask the study participants why their career plateaued then.
Finally gonna start properly experimenting on stuff. Just writing up what I’m doing to force myself to do something, not claiming this is necessarily particularly important.
Llama (and many other models, but I’m doing experiments on Llama) has a piece of code that looks like this:
h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
out = h + self.feed_forward(self.ffn_norm(h))
Here, out is the result of the transformer layer (aka the residual stream), and the vectors self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) and self.feed_forward(self.ffn_norm(h)) are basically where all the computation happens. So basically the transformer proceeds as a series of “writes” to the residual stream using these two vectors.
I took all the residual vectors for some queries to Llama-8b and stacked them into a big matrix M with 4096 columns (the internal hidden dimensionality of the model). Then using SVD, I can express M=∑isi(ui⊗vi), where the u‘s and v’s are independent units vectors. This basically decomposes the “writes” into some independent locations in the residual stream (u’s), some latent directions that are written to (v’s) and the strength of those writes (s’s, aka the singular values).
To get a feel for the complexity of the writes, I then plotted the s’s in descending order. For the prompt “I believe the meaning of life is”, Llama generated the continuation “to be happy. It is a simple concept, but it is very difficult to achieve. The only way to achieve it is to follow your heart. If you follow your heart, you will find happiness. If you don’t follow your heart, you will never find happiness. I believe that the meaning of life is to”. During this continuation, there were 2272 writes to the residual stream, and the singular values for these writes were as follows:
The first diagram shows that there were 2 directions that were much larger than all the others. The second diagram shows that most of the singular values are nonnegligible, which indicates to me that almost all of the writes transfer nontrivial information. This can also be seen in the last diagram, where the cumulative size of the singular values increases approximately linearly with their count.
This is kind of unfortunate, because if almost all of the s was concentrated in a relatively small number of dimensions (e.g. 100), then we could simplify the network a lot by projecting down to these dimensions. Still, this was relatively expected because others had found the singular values of the neural networks to be very complex.
Since variance explained is likely nonlinearly related to quality, my next step will likely be to clip the writes to the first k singular vectors and see how that impacts the performance of the network.