This research was performed by Luke Marks, Amirali Abdullah, nothoughtsheadempty and Rauno Arike. Special thanks to Fazl Barez from Apart Research for overseeing the project and contributing greatly to direction and oversight throughout. We’d also like to thank Logan Riggs for feedback and suggestions regarding autoencoder architecture and experiment design.

Introduction

Sparse Autoencoders Find Highly Interpretable Directions in Language Models showed that sparse coding achieves SOTA performance in making features interpretable using OpenAI’s method of automated interpretability. We briefly tried to extend these results to reward models learned during RLHF in Pythia-70m/410m. Our method can be summarized as follows:

1. Identify layers $L$ in an language model fine-tuned through $M_{R L H F}$ likely involved in reward modeling. We do so by sorting layers in order of decreasing parameter divergence under the Euclidean norm. Notation is simplified in the succeeding steps by describing our feature extraction for a single fixed layer $ℓ$ of $L$ .

2. For both $M_{R L H F}$ and a base model $M_{B A S E}$ , train two autoencoders ${A E}_{1}$ and ${A E}_{2}$ of differing hidden sizes with the same sparsity constraint. These autoencoders reconstruct activation vectors on $ℓ$ for their respective model. For each model, we extract a pair of lower dimensional feature dictionaries $D_{1}$ and $D_{2}$ from the corresponding autoencoder. Each feature is a column of the decoder’s weight matrix.

3. Because autoencoders produce varying dictionaries over training runs and hyperparameters, we keep only the features that occur in both $D_{1}$ and $D_{2}$ . We compute the Max Cosine Similarity (MCS) between features in $D_{1}$ and $D_{2}$ in order to identify repeating features across the two dictionaries, indicating that shared features truly occur in the model. The Mean Max Cosine Similarity (MMCS)^[1] is an aggregate measure of the quality of our extracted features.

4. The top- $k$ most similar features between $D_{1}$ and $D_{2}$ in terms of MCS are explained using GPT-4 as per the method detailed here and originally here. This involves feeding the encoder of ${A E}_{n}$ activations from the model on which it was trained, and then having GPT-4 predict a description of that feature from the feature weights specified in the encoder output. Following this, GPT-4 then simulates weights for that feature as if the predicted description was true. The Pearson correlation coefficient for the predicted weights and actual weights serves as a grading for the accuracy of this description.

5. By explicitly comparing these explanations in $M_{R L H F}$ and $M_{B A S E}$ , we investigate a case study related to reward modeling, showing how these descriptions can be correlated with reward modeling efficacy.

6. This method is applied to a training regime in which $M_{R L H F}$ is tasked with learning an explicit table of words and maximizing their presence having been exposed to RLHF with proximal policy optimization. This training environment allows us to quantitatively assess the efficacy of $M_{R L H F}$ ’s reward model.

Model Training

An overseer, denoted by $O$ , is imbued with a ‘utility table’ $U$ : a mapping of words to respective utility values. $O$ converts a tokenized generation to words, and then computes the utility of the generation and prefix together. To delineate the architecture:

Utility Designation: Each word, represented as $w$ , has an associated utility value defined as $U (w)$ . For example;

Word	Utility
Happy	4
Sad	-3
Satisfied	3
Angry	-3

Overseer ( $O$ ): A script that converts a tokenized sequence to words and takes a sum of their corresponding utility values in accordance with a utility table $U$ .

Student Model ( $M_{R L H F}$ ): The model undergoing fine-tuning, shaped by feedback from the overseer.

State ( $s$ ): Symbolizes a prompt or input directed to $M_{R L H F}$ .

Action ( $a$ ): Denotes the response generated by $M_{R L H F}$ corresponding to state $s$ .

Reward Mechanism: For any generated action $a$ , a sequence of tokens $t_{1}, t_{2}, . . . t_{n}$ , the reward $Reward (a)$ is calculated as $Reward (a) = \sum_{i = 1}^{n} U (w_{i})$ . As is common in RLHF, we train a policy model to maximize reward, while minimizing KL-divergence of generations from the reference base model otherwise. Here, $π_{θ} (a | s)$ denotes the policy of $M_{R L H F}$ , which is parameterized by $θ$ , signifying the probability of generating action $a$ given state $s$ .

The utility values used in $U$ were extracted from the VADER lexicon, which contains sentiment values assigned by a set of human annotators ranging from $- 4$ (extremely negative) to $4$ (extremely positive), with an average taken over ten annotations per word. We assigned reward to a sentence as a sum of utilities, scaled by 5 and clamped to an interval of $[- 10, 10]$ , comprising our utility table, $U$ . The scaling and clip constants were empirically chosen to keep the RLHF tuning from diverging due to the high rewards.

$Reward (s) = clip (\frac{1}{5} \sum_{token \in s} U (token), - 10, + 10)$

Results and Discussion

Fine-tuning (mostly arbitrarily) to create $M_{R L H F}$ on the IMDb reviews dataset, we use GPT-4 to assign descriptions to features, and then compute the absolute utility of the top- $k$ most similar feature descriptions as a proxy for reward modeling efficacy. The idea is that a model that better encapsulates $U$ should represent more features relevant to it. As an example, comparing this fine-tune of Pythia-410m to the base model (which was trained in accordance with the description from earlier). W top- $k$ value of 30, we found that $M_{B A S E}$ scored $58.5$ using this metric, whereas $M_{R L H F}$ scored $80.6$ . This pattern held for the 70m and 160m variants with base and fine-tuned scores of $58.1$ , $90.9$ and $43.4$ , $68.1$ respectively. This could be a fairly primitive metric, especially given our autoencoders aren’t necessarily capturing an accurate sample of features with a sample size of 150 features, and that feature weightings could easily counteract a lack of representation of features with high utility descriptions. Future experiments might weight utilities by average feature activations over a corpus of inputs to account for this.

We also fine-tuned Pythia-70m toward positive sentiment completions for the same dataset under the classification of a DistilBERT sentiment classifier trained to convergence. Reward is assigned the logit of the positive sentiment label. We used the method described in the beginning of the post to get feature descriptions for the top- $k = 10$ features for each layer.

Features identified as detecting opinions concerning movies in itself serves as a great example of both the utility and shortcomings of this method. Being able to detect the occurrence of an opinion regarding a movie is reasonable given the training objective of generating positive sentiment completions, but the description is very high-level and overrepresented in the feature descriptions. In the fine-tuned Pythia-70m instance, of the 50 highest similarity features (10 per high-divergence layer), there are 21 feature descriptions that mention detecting opinions or reviews in the context of movies. Of the top- $k = 10$ features in layer 4 of the fine-tuned model, 8 are for this purpose. Contrast this to the base model, with 13 total feature descriptions focused on sentiment in the context of movie reviews.

This data alone does not allow for a clear picture of the reward model to be constructed. Although in the limited sample it is clear that a greater portion of the features represent concepts related to the training objective, it cannot be shown that the model has properly internalized the reward model on which it was trained. Additionally, it is highly improbable for the base model to inherently have 13 of the 50 sampled features applied to identifying opinions on movies, which shows that the nature of the input data used to sample activations can skew GPT-4s description of the feature. If a feature consistently activates on negative opinions, and the entire sample set is movie reviews, it might be unclear to GPT-4 whether the feature is activating in response to negative sentiment, or negative sentiment in movie reviews specifically, for example. In the future more diverse datasets will be used to account for this. Here are some example features from layer 2 of the fine-tuned Pythia-70m instance, which are likely not all monosemantic, but interesting nonetheless:

Feature Index in Dictionary	GPT-4 Description
99	activating for hyphenated or broken-up words or sequences within the text data.
39	recognizing and activating for named entities, particularly proper names of people and titles in the text.
506	looking for expressions related to movie reviews or comments about movies.
377	looking for noun phrases or entities in the text as it seems to activate for proper nouns, abstract concepts, and possibly structured data.
62	looking for instances where names of people or characters, potentially those related to films or novels, are mentioned in the text.
428	looking for instances of movie or TV show titles and possibly related commentary or reviews.
433	identifying the start of sentences or distinct phrases, as all the examples feature a non-zero activation at the beginning of the sentences.
406	looking for broken or incomplete words in the text, often indicated by a space or special character appearing within the word.
148	identifying and activating for film-related content and reviews.

We’re actively pursuing this. For an example of the kind of experiments we’re interested in running, we are considering setups like training the encoder to compress activations for $M_{B A S E}$ , and the decoder to reconstruct those compressed activations as though they were sampled from $M_{R L H F}$ under the same inputs such that we procure a dictionary of feature differences in place of likely ground truth features. There seems to be lots of room for experimentation in the optimal use-case for sparse coding generally, as well as in understanding learned reward models. We’re currently working towards a paper with a much greater experimental depth, and if sparse coding for reward models interests you, please reach out over LessWrong for a discussion.

^
Given by $MMCS (D, D^{'}) = \frac{1}{| D |} \sum_{d \in D} {max}_{d^{'} \in D^{'}} CosineSim (d, d^{'})$ where $D$ and $D^{'}$ are learned dictionaries, $D_{g}$ is the top- $k$ features of $D$ that realize the highest contribution to the MMCS. In the case of LLMs, the ground truth features are unknown and so the set $D_{g}$ is used as a proxy for a true representation of the ground truth features.

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

Introduction

Model Training

Results and Discussion