RogerDearnaley comments on Investigating Bias Representations in LLMs via Activation Steering

RogerDearnaley 19 Jan 2024 10:58 UTC
LW: 3 AF: 2
0
AF
Interestingly, I found a very high correlation between gender bias and racial bias in the RLHF model (first graph below on the left). This result is especially pronounced when contrasted with the respective cosine similarity of the bias vectors in the base model.
On a brief search, it looks like Llama2 7B has an internal embedding dimension of 4096 (certainly it’s in the thousands). In a space of that large a dimensionality, a cosine angle of even 0.5 indicates extremely similar vectors: O(99.9%) of random pairs of uncorrelated vectors will have cosines of less than 0.5, and on average the cosine of two random vectors will very close to zero. So at all but the latest layers (where the model is actually putting concepts back into words), all three of these bias directions are in very similar directions, in both base and RLHF models, and even more so at early layers in the base model or all layers in the RLHF model.
In the base model this makes sense sociologically: the locations and documents on the Internet where you find any one of these will also tend to be significantly positively correlated with the other two, they tend to co-occur.