1. Identify layers L in an language model fine-tuned through MRLHF likely involved in reward modeling. We do so by sorting layers in order of decreasing parameter divergence under the Euclidean norm. Notation is simplified in the succeeding steps by describing our feature extraction for a single fixed layer ℓ of L.
2. For both MRLHF and a base model MBASE, train two autoencoders AE1 and AE2 of differing hidden sizes with the same sparsity constraint. These autoencoders reconstruct activation vectors on ℓ for their respective model. For each model, we extract a pair of lower dimensional feature dictionaries D1 and D2 from the corresponding autoencoder. Each feature is a column of the decoder’s weight matrix.
3. Because autoencoders produce varying dictionaries over training runs and hyperparameters, we keep only the features that occur in both D1 and D2. We compute the Max Cosine Similarity (MCS) between features in D1 and D2 in order to identify repeating features across the two dictionaries, indicating that shared features truly occur in the model. The Mean Max Cosine Similarity (MMCS)[1] is an aggregate measure of the quality of our extracted features.
4. The top-k most similar features between D1 and D2 in terms of MCS are explained using GPT-4 as per the method detailed here and originally here. This involves feeding the encoder of AEn activations from the model on which it was trained, and then having GPT-4 predict a description of that feature from the feature weights specified in the encoder output. Following this, GPT-4 then simulates weights for that feature as if the predicted description was true. The Pearson correlation coefficient for the predicted weights and actual weights serves as a grading for the accuracy of this description.
5. By explicitly comparing these explanations in MRLHF and MBASE, we investigate a case study related to reward modeling, showing how these descriptions can be correlated with reward modeling efficacy.
6. This method is applied to a training regime in which MRLHF is tasked with learning an explicit table of words and maximizing their presence having been exposed to RLHF with proximal policy optimization. This training environment allows us to quantitatively assess the efficacy of MRLHF’s reward model.
Model Training
An overseer, denoted by O, is imbued with a ‘utility table’ U: a mapping of words to respective utility values.O converts a tokenized generation to words, and then computes the utility of the generation and prefix together. To delineate the architecture:
Utility Designation: Each word, represented as w, has an associated utility value defined as U(w). For example;
Word
Utility
Happy
4
Sad
-3
Satisfied
3
Angry
-3
Overseer (O): A script that converts a tokenized sequence to words and takes a sum of their corresponding utility values in accordance with a utility table U.
Student Model (MRLHF): The model undergoing fine-tuning, shaped by feedback from the overseer.
State (s): Symbolizes a prompt or input directed to MRLHF.
Action (a): Denotes the response generated by MRLHF corresponding to state s.
Reward Mechanism: For any generated action a, a sequence of tokens t1,t2,...tn, the reward Reward(a) is calculated as Reward(a)=∑ni=1U(wi). As is common in RLHF, we train a policy model to maximize reward, while minimizing KL-divergence of generations from the reference base model otherwise. Here, πθ(a|s) denotes the policy of MRLHF, which is parameterized by θ, signifying the probability of generating action a given state s.
The utility values used in U were extracted from the VADER lexicon, which contains sentiment values assigned by a set of human annotators ranging from −4 (extremely negative) to 4 (extremely positive), with an average taken over ten annotations per word. We assigned reward to a sentence as a sum of utilities, scaled by 5 and clamped to an interval of [−10,10], comprising our utility table, U. The scaling and clip constants were empirically chosen to keep the RLHF tuning from diverging due to the high rewards.
Reward(s)=clip(15∑token∈sU(token),−10,+10)
Results and Discussion
Fine-tuning (mostly arbitrarily) to create MRLHF on the IMDb reviews dataset, we use GPT-4 to assign descriptions to features, and then compute the absolute utility of the top-k most similar feature descriptions as a proxy for reward modeling efficacy. The idea is that a model that better encapsulates U should represent more features relevant to it. As an example, comparing this fine-tune of Pythia-410m to the base model (which was trained in accordance with the description from earlier). W top-k value of 30, we found that MBASE scored 58.5 using this metric, whereas MRLHF scored 80.6. This pattern held for the 70m and 160m variants with base and fine-tuned scores of 58.1, 90.9 and 43.4, 68.1 respectively. This could be a fairly primitive metric, especially given our autoencoders aren’t necessarily capturing an accurate sample of features with a sample size of 150 features, and that feature weightings could easily counteract a lack of representation of features with high utility descriptions. Future experiments might weight utilities by average feature activations over a corpus of inputs to account for this.
We also fine-tuned Pythia-70m toward positive sentiment completions for the same dataset under the classification of a DistilBERT sentiment classifier trained to convergence. Reward is assigned the logit of the positive sentiment label. We used the method described in the beginning of the post to get feature descriptions for the top-k=10 features for each layer.
Features identified as detecting opinions concerning movies in itself serves as a great example of both the utility and shortcomings of this method. Being able to detect the occurrence of an opinion regarding a movie is reasonable given the training objective of generating positive sentiment completions, but the description is very high-level and overrepresented in the feature descriptions. In the fine-tuned Pythia-70m instance, of the 50 highest similarity features (10 per high-divergence layer), there are 21 feature descriptions that mention detecting opinions or reviews in the context of movies. Of the top-k=10 features in layer 4 of the fine-tuned model, 8 are for this purpose. Contrast this to the base model, with 13 total feature descriptions focused on sentiment in the context of movie reviews.
This data alone does not allow for a clear picture of the reward model to be constructed. Although in the limited sample it is clear that a greater portion of the features represent concepts related to the training objective, it cannot be shown that the model has properly internalized the reward model on which it was trained. Additionally, it is highly improbable for the base model to inherently have 13 of the 50 sampled features applied to identifying opinions on movies, which shows that the nature of the input data used to sample activations can skew GPT-4s description of the feature. If a feature consistently activates on negative opinions, and the entire sample set is movie reviews, it might be unclear to GPT-4 whether the feature is activating in response to negative sentiment, or negative sentiment in movie reviews specifically, for example. In the future more diverse datasets will be used to account for this. Here are some example features from layer 2 of the fine-tuned Pythia-70m instance, which are likely not all monosemantic, but interesting nonetheless:
Feature Index in Dictionary
GPT-4 Description
99
activating for hyphenated or broken-up words or sequences within the text data.
39
recognizing and activating for named entities, particularly proper names of people and titles in the text.
506
looking for expressions related to movie reviews or comments about movies.
377
looking for noun phrases or entities in the text as it seems to activate for proper nouns, abstract concepts, and possibly structured data.
62
looking for instances where names of people or characters, potentially those related to films or novels, are mentioned in the text.
428
looking for instances of movie or TV show titles and possibly related commentary or reviews.
433
identifying the start of sentences or distinct phrases, as all the examples feature a non-zero activation at the beginning of the sentences.
406
looking for broken or incomplete words in the text, often indicated by a space or special character appearing within the word.
148
identifying and activating for film-related content and reviews.
We’re actively pursuing this. For an example of the kind of experiments we’re interested in running, we are considering setups like training the encoder to compress activations for MBASE, and the decoder to reconstruct those compressed activations as though they were sampled from MRLHF under the same inputs such that we procure a dictionary of feature differences in place of likely ground truth features. There seems to be lots of room for experimentation in the optimal use-case for sparse coding generally, as well as in understanding learned reward models. We’re currently working towards a paper with a much greater experimental depth, and if sparse coding for reward models interests you, please reach out over LessWrong for a discussion.
Given by MMCS(D,D′)=1|D|∑d∈Dmaxd′∈D′CosineSim(d,d′) where D and D′ are learned dictionaries, Dg is the top-k features of D that realize the highest contribution to the MMCS. In the case of LLMs, the ground truth features are unknown and so the set Dg is used as a proxy for a true representation of the ground truth features.
Early Experiments in Reward Model Interpretation Using Sparse Autoencoders
This research was performed by Luke Marks, Amirali Abdullah, nothoughtsheadempty and Rauno Arike. Special thanks to Fazl Barez from Apart Research for overseeing the project and contributing greatly to direction and oversight throughout. We’d also like to thank Logan Riggs for feedback and suggestions regarding autoencoder architecture and experiment design.
Introduction
Sparse Autoencoders Find Highly Interpretable Directions in Language Models showed that sparse coding achieves SOTA performance in making features interpretable using OpenAI’s method of automated interpretability. We briefly tried to extend these results to reward models learned during RLHF in Pythia-70m/410m. Our method can be summarized as follows:
1. Identify layers L in an language model fine-tuned through MRLHF likely involved in reward modeling. We do so by sorting layers in order of decreasing parameter divergence under the Euclidean norm. Notation is simplified in the succeeding steps by describing our feature extraction for a single fixed layer ℓ of L.
2. For both MRLHF and a base model MBASE, train two autoencoders AE1 and AE2 of differing hidden sizes with the same sparsity constraint. These autoencoders reconstruct activation vectors on ℓ for their respective model. For each model, we extract a pair of lower dimensional feature dictionaries D1 and D2 from the corresponding autoencoder. Each feature is a column of the decoder’s weight matrix.
3. Because autoencoders produce varying dictionaries over training runs and hyperparameters, we keep only the features that occur in both D1 and D2. We compute the Max Cosine Similarity (MCS) between features in D1 and D2 in order to identify repeating features across the two dictionaries, indicating that shared features truly occur in the model. The Mean Max Cosine Similarity (MMCS)[1] is an aggregate measure of the quality of our extracted features.
4. The top-k most similar features between D1 and D2 in terms of MCS are explained using GPT-4 as per the method detailed here and originally here. This involves feeding the encoder of AEn activations from the model on which it was trained, and then having GPT-4 predict a description of that feature from the feature weights specified in the encoder output. Following this, GPT-4 then simulates weights for that feature as if the predicted description was true. The Pearson correlation coefficient for the predicted weights and actual weights serves as a grading for the accuracy of this description.
5. By explicitly comparing these explanations in MRLHF and MBASE, we investigate a case study related to reward modeling, showing how these descriptions can be correlated with reward modeling efficacy.
6. This method is applied to a training regime in which MRLHF is tasked with learning an explicit table of words and maximizing their presence having been exposed to RLHF with proximal policy optimization. This training environment allows us to quantitatively assess the efficacy of MRLHF’s reward model.
Model Training
An overseer, denoted by O, is imbued with a ‘utility table’ U: a mapping of words to respective utility values.O converts a tokenized generation to words, and then computes the utility of the generation and prefix together. To delineate the architecture:
Utility Designation: Each word, represented as w, has an associated utility value defined as U(w). For example;
Overseer (O): A script that converts a tokenized sequence to words and takes a sum of their corresponding utility values in accordance with a utility table U.
Student Model (MRLHF): The model undergoing fine-tuning, shaped by feedback from the overseer.
State (s): Symbolizes a prompt or input directed to MRLHF.
Action (a): Denotes the response generated by MRLHF corresponding to state s.
Reward Mechanism: For any generated action a, a sequence of tokens t1,t2,...tn, the reward Reward(a) is calculated as Reward(a)=∑ni=1U(wi). As is common in RLHF, we train a policy model to maximize reward, while minimizing KL-divergence of generations from the reference base model otherwise. Here, πθ(a|s) denotes the policy of MRLHF, which is parameterized by θ, signifying the probability of generating action a given state s.
The utility values used in U were extracted from the VADER lexicon, which contains sentiment values assigned by a set of human annotators ranging from −4 (extremely negative) to 4 (extremely positive), with an average taken over ten annotations per word. We assigned reward to a sentence as a sum of utilities, scaled by 5 and clamped to an interval of [−10,10], comprising our utility table, U. The scaling and clip constants were empirically chosen to keep the RLHF tuning from diverging due to the high rewards.
Reward(s)=clip(15∑token∈sU(token),−10,+10)
Results and Discussion
Fine-tuning (mostly arbitrarily) to create MRLHF on the IMDb reviews dataset, we use GPT-4 to assign descriptions to features, and then compute the absolute utility of the top-k most similar feature descriptions as a proxy for reward modeling efficacy. The idea is that a model that better encapsulates U should represent more features relevant to it. As an example, comparing this fine-tune of Pythia-410m to the base model (which was trained in accordance with the description from earlier). W top-k value of 30, we found that MBASE scored 58.5 using this metric, whereas MRLHF scored 80.6. This pattern held for the 70m and 160m variants with base and fine-tuned scores of 58.1, 90.9 and 43.4, 68.1 respectively. This could be a fairly primitive metric, especially given our autoencoders aren’t necessarily capturing an accurate sample of features with a sample size of 150 features, and that feature weightings could easily counteract a lack of representation of features with high utility descriptions. Future experiments might weight utilities by average feature activations over a corpus of inputs to account for this.
We also fine-tuned Pythia-70m toward positive sentiment completions for the same dataset under the classification of a DistilBERT sentiment classifier trained to convergence. Reward is assigned the logit of the positive sentiment label. We used the method described in the beginning of the post to get feature descriptions for the top-k=10 features for each layer.
Features identified as detecting opinions concerning movies in itself serves as a great example of both the utility and shortcomings of this method. Being able to detect the occurrence of an opinion regarding a movie is reasonable given the training objective of generating positive sentiment completions, but the description is very high-level and overrepresented in the feature descriptions. In the fine-tuned Pythia-70m instance, of the 50 highest similarity features (10 per high-divergence layer), there are 21 feature descriptions that mention detecting opinions or reviews in the context of movies. Of the top-k=10 features in layer 4 of the fine-tuned model, 8 are for this purpose. Contrast this to the base model, with 13 total feature descriptions focused on sentiment in the context of movie reviews.
This data alone does not allow for a clear picture of the reward model to be constructed. Although in the limited sample it is clear that a greater portion of the features represent concepts related to the training objective, it cannot be shown that the model has properly internalized the reward model on which it was trained. Additionally, it is highly improbable for the base model to inherently have 13 of the 50 sampled features applied to identifying opinions on movies, which shows that the nature of the input data used to sample activations can skew GPT-4s description of the feature. If a feature consistently activates on negative opinions, and the entire sample set is movie reviews, it might be unclear to GPT-4 whether the feature is activating in response to negative sentiment, or negative sentiment in movie reviews specifically, for example. In the future more diverse datasets will be used to account for this. Here are some example features from layer 2 of the fine-tuned Pythia-70m instance, which are likely not all monosemantic, but interesting nonetheless:
We’re actively pursuing this. For an example of the kind of experiments we’re interested in running, we are considering setups like training the encoder to compress activations for MBASE, and the decoder to reconstruct those compressed activations as though they were sampled from MRLHF under the same inputs such that we procure a dictionary of feature differences in place of likely ground truth features. There seems to be lots of room for experimentation in the optimal use-case for sparse coding generally, as well as in understanding learned reward models. We’re currently working towards a paper with a much greater experimental depth, and if sparse coding for reward models interests you, please reach out over LessWrong for a discussion.
Given by MMCS(D,D′)=1|D|∑d∈Dmaxd′∈D′CosineSim(d,d′) where D and D′ are learned dictionaries, Dg is the top-k features of D that realize the highest contribution to the MMCS. In the case of LLMs, the ground truth features are unknown and so the set Dg is used as a proxy for a true representation of the ground truth features.