This post has some ablation results around the thesis of the ICML 2024 Mech. Interp. workshop 1st prize winning paper: The Geometry of Categorical and Hierarchical Concepts in Large Language Models The main takeaway is that the orthogonality they observe in categorical and hierarchical concepts occurs practically everywhere, even at places where it really should not.

Overview of the original paper

A lot of the intuition and math around why they do what they do is shared in their previous work called The Linear Representation Hypothesis and the Geometry of Large Language Models, but let’s quickly go over what the paper’s core idea is:

They split the computation of a large language model (LLM) as:

$P (y ∣ x) = \frac{exp (λ (x)^{⊤} γ (y))}{\sum_{y^{'} \in Vocab} exp (λ (x)^{⊤} γ (y^{'}))}$

where:
- $λ (x)$ is the context embedding for input $x$ (last token’s residual after last layer),
- $γ (y)$ is the unembedding vector for output $y$ .

Next, to align the embedding and unembedding spaces and make the Euclidean inner product a causal one (see the paper for details), a transformation using the covariance matrix of the unembedding vectors is applied:

$g (y) = Cov (γ)^{- 1 / 2} (γ (y) - E [γ])$

where $γ$ is the unembedding vector, $E [γ]$ is the expected unembedding vector, and $Cov$ is the covariance matrix of $γ$ .

Now, for any concept $W$ , its vector representation $ℓ_{W}$ is defined to be one that follows these two constraints:

$P (W = 1 ∣ ℓ + α ℓ_{W}) > P (W = 1 ∣ ℓ) \forall α > 0$ $P (Z ∣ ℓ + α ℓ_{W}) = P (Z ∣ ℓ) for off-target concept Z$

Given such a vector representation $l_{w}$ for binary concepts (where $ℓ_{w_{1}} - ℓ_{w_{0}}$ is the linear representation of $w_{0} \Rightarrow w_{1}$ ), the following orthogonality relations hold:

$ℓ_{w} ⊥ (ℓ_{z} - ℓ_{w}) for z ≺ w$
I’m skipping some notation here, but this says that for hierarchical concepts mammal $≺$ animal, we have $ℓ_{a n i m a l} ⊥ (ℓ_{m a m m a l} - ℓ_{a n i m a l})$ .
$ℓ_{w} ⊥ (ℓ_{z_{1}} - ℓ_{z_{0}}) for {z_{0}, z_{1}} ≺ w$
Similarly, this means $ℓ_{a n i m a l} ⊥ (ℓ_{m a m m a l} - ℓ_{r e p t i l e})$ .
$(ℓ_{w_{1}} - ℓ_{w_{0}}) ⊥ (ℓ_{z_{1}} - ℓ_{z_{0}}) for {z_{0}, z_{1}} ≺ {w_{0}, w_{1}}$
$(ℓ_{w_{1}} - ℓ_{w_{0}}) ⊥ (ℓ_{w_{2}} - ℓ_{w_{1}}) for w_{2} ≺ w_{1} ≺ w_{0}$

Also, they show that categorical concepts form simplices in the transformed representation space. For each of these theorems, they give concrete proofs and provide experimental evidence on GPT-4 generated data and Gemma representations for animal categories and plants. They start with a dataset that looks like:

animals = {
    "mammal": ["beaver", "panther", "lion", "llama", "colobus", ... ], 
    "bird": ["wigeon", "parrot", "albatross", "cockatoo", "magpie", ... ], 
    "fish": ["snapper", "anchovy", "moonfish", "herring", ... ], 
    "amphibian": ["bullfrog", "siren", "toad", "treefrog", ...], 
    "insect": ["mayfly", "grasshopper", "bedbug", "silverfish", ...]
	}

Ablations

To study concepts that do not form such semantic categories and hierarchies, we add the following two datasets and play around with their codebase:

First, an “emotions” dictionary for various kinds of emotions split in various top-level emotions. Note that these categories are not expected to be orthogonal (for instance, joy and sadness should be anti-correlated). We create this via a simple call to ChatGPT.

emotions = {
   'joy': ['mirth', 'thrill', 'bliss', 'relief', 'admiration', ...],
   'sadness': ['dejection', 'anguish', 'nostalgia', 'melancholy', ...],
   'anger': ['displeasure', 'spite', 'irritation', 'disdain', ...],
   'fear': ['nervousness', 'paranoia', 'discomfort', 'helplessness', ...],
   'surprise': ['enthrallment', 'unexpectedness', 'revitalization', ...],
   'disgust': ['detestation', 'displeasure', 'prudishness', 'disdain', ...]
	}

Next, we add a “nonsense” dataset that has five completely random categories where each category is defined by a lot (order of 100) of totally random words completely unrelated to the top-level categories. This will help us get directions for random nonsensical concepts (again, via a ChatGPT call):

nonsense = {
   "random 1": ["toaster", "penguin", "jelly", "cactus", "submarine", ...],
   "random 2": ["sandwich", "yo-yo", "plank", "rainbow", "monocle", ...],
   "random 3": ["kiwi", "tornado", "chopstick", "helicopter", "sunflower", ...],
   "random 4": ["ocean", "microscope", "tiger", "pasta", "umbrella", ...],
   "random 5": ["banjo", "skyscraper", "avocado", "sphinx", "teacup", ...]
}

Hierarchical features are orthogonal—but so are semantic opposites!?

Now, let’s look at their main experimental results (for animals):

$ℓ_{a n i m a l} ⊥ (ℓ_{m a m m a l} - ℓ_{a n i m a l})$

And this is what we get for emotions:

While this seems okay, look at the following plot where we just look at joy and sadness in the span of sadness and all emotions:

Should we really have $ℓ_{s a d n e s s} ⊥ (ℓ_{s a d n e s s} - ℓ_{j o y})$ ?

Sadness and joy are semantic opposites, so one should expect the vectors to be anti-correlated rather than orthogonal. Also, here’s the same plot but for completely random, non-sensical concepts:

It seems like their orthogonality results, while true for hierarchical concepts, are also true for semantically opposite concepts and totally random concepts. 🤔

Categorical features form simplices—but so do totally random ones!?

Here is the simplex they find animal categories to form:

And this is what we get for completely random concepts:

Thus, while categorical concepts form simplices, so do completely random, non-sensical concepts.

Orthogonality being ubiquitous in high dimensions

Because a model’s representation space is quite high-dimensional ( $2048$ in the case of Gemma), and $k$ independent random vectors are expected to be almost orthogonal in high-dimensional spaces. Multiple concrete proofs of the same are given here, but here’s a quick intuitive sketch:

Let $v_{1}, v_{2}, \dots, v_{k}$ be $k$ random vectors in $R^{n}$ with their components distributed independently and uniformly on the unit sphere. The inner product between two vectors $v_{i}$ and $v_{j}$ is given by:

$⟨ v_{i}, v_{j} ⟩ = n \sum l = 1 v_{i, l} v_{j, l}$

and has the following variance:

$Var [⟨ v_{i}, v_{j} ⟩] = \frac{1}{n}$

Thus, as $n$ (the number of dimensions) increases, the variance tends to zero:

$lim n \to \infty Var [⟨ v_{i}, v_{j} ⟩] = 0$

This implies that, in high dimensions, random vectors are almost orthogonal with high probability:

$⟨ v_{i}, v_{j} ⟩ \approx 0 for i \neq j$

Discussion and Future Work

A transformation where opposite concepts seem orthogonal doesn’t seem good for studying models. It breaks our semantic model of associating directions with concepts and makes steering both ways impossible.

As for categorical features forming simplices in the representation space, this claim isn’t surprising because every single thing seems to form simplices, even totally random concepts.

There are lots of ways of assigning a direction or a vector to a given concept or datapoint, and it is unclear if the vectors thus obtained are correct or uniquely identifiable.

Here are some of the questions this leaves us with and ones that we’d be very excited to work on in the near future (contact us to collaborate!):

A framework on how to think about representations that unifies how they’re obtained (contrastive activations, PCA, SAE, etc.), how they’re used (by the model), and how they can be used to control (eg. via steering vectors).
How to figure out how well a given object (a direction, a vector, or even a black-box function over model parameters) represents a given human-interpretable concept or feature.
If orthogonality and simplices are too universal and not specific enough to study the geometry of categorical and hierarchical concepts, then what is a good lens or theory to do so?