This is my first post on the platform and my first set of experiments with GPT-2 using TransformerLens. If you spot any interesting insights or mistakes, feel free to share your thoughts in the comments. While these findings aren’t entirely novel and may seem trivial, I’m presenting them here as a reference for anyone exploring this topic for the first time!

All the code with some extra analysis [not included in this post] is available here

Introduction and Motivation

Fine-tuning large language models (LLMs) is widely used to adapt models to specific tasks, yet the fundamental question remains: What actually changes in the model’s internal representations? Prior research suggests that fine-tuning induces significant behavioral shifts despite minimal weight changes. This contradiction raises an important question: If the weight updates are small, what aspects of the model’s internal structure drive these drastic changes?

A particularly relevant study, Refusal is Mediated in One Directionby Arditi et al., explored how refusal behavior in LLMs can be decomposed into interpretable activation patterns. Inspired by their approach, I sought to investigate whether sentiment-based fine-tuning also results in a distinct “sentiment direction” in embedding space, and whether this direction can be meaningfully analyzed.

Key Contributions and Findings

To systematically explore this, I fine-tuned GPT-2 on the IMDB sentiment classification dataset and conducted several analyses to understand how fine-tuning alters embedding space and model activations. Specifically, I:

Derived a “sentiment direction” from the fine-tuned model and examined its alignment with the baseline model.
Applied causal mediation analysis—replacing layers of the fine-tuned model with those from the baseline—to test how different components contribute to observed changes.
Transferred hidden layer activations from the fine-tuned model into the LM head of the baseline GPT-2, testing whether sentiment information remains decodable.
Tracked token shifts in embedding space, identifying which words experience the most significant positional changes post-fine-tuning.
Compared the norm of the sentiment direction across baseline and fine-tuned models, revealing notable structural shifts.

Why This Matters for AI Alignment and Interpretability

These findings contribute to ongoing discussions in LLM interpretability, model editing, and activation steering. If fine-tuning reliably introduces latent feature directions in a model’s activation space, this raises the possibility of targeted behavioral interventions—e.g., modifying models without retraining by directly adjusting their activation patterns.

However, the analysis also presents challenges. Comparing the sentiment direction norms across different fine-tuned versions introduces potential methodological concerns—are these shifts genuinely meaningful, or do they arise from artifact effects? Further investigation is needed to disentangle causal changes in representation space from mere surface-level alignment shifts.

Analysis before finetuning

The sentiment direction is computed by measuring the difference between the mean activation vectors for positive and negative sentiment samples at the final token position. The Euclidean norm of this difference vector is then used to quantify the strength of sentiment separation at each layer.

As expected, sentiment isn’t captured in a single neuron but emerges as a structured pattern across layers. The sentiment direction norm helps us quantify where and how strongly a model differentiates between positive and negative text. We measure how the model’s internal representation shifts based on sentiment by extracting residual stream activations [at the last token position] before any transformations at each layer. The difference vector between mean activations for positive and negative inputs reveals a sentiment-specific direction in activation space, and its L2 norm tells us the magnitude of that shift.

Early layers (low norms) barely separate sentiment, focusing on syntactic structures, while later layers (higher norms) increasingly specialize in sentiment-based distinctions. The steadily rising values, peaking in the deepest layers, suggest that the model refines and amplifies sentiment-related information as it processes inputs, likely making final layers most useful for tasks like sentiment classification.

Finetuning

The fine-tuning process for GPT-2 on IMDb reviews follows a causal language modeling (CLM) objective, meaning the model learns to predict the next token given the previous tokens. The loss function used in fine-tuning is the cross-entropy loss, which measures how well the model predicts each token in the sequence. Since labels are identical to input IDs in causal language modeling (tokenized["labels"] = tokenized["input_ids"].copy()), the model is trained in a self-supervised manner, adjusting its weights to improve token prediction based on sentiment-labeled reviews.

The evaluation of the base GPT-2 model and the fine-tuned model on a subset of 200 IMDb test reviews reveals a significant improvement in sentiment classification accuracy after fine-tuning. The base model, which was not explicitly trained for sentiment classification, achieves only 45.5% accuracy, which is surprising! [given it is less than random chance].

However, after fine-tuning on IMDb reviews with sentiment-labeled prompts, the model’s accuracy jumps to 96.0%, demonstrating that it has learned to effectively distinguish between positive and negative sentiment. This dramatic performance gain suggests that fine-tuning successfully aligned the model’s residual stream activations with sentiment distinctions.

Post—Finetune analysis

We observe signs of slight overfitting beginning at epoch 3, as indicated by a drop in generalization despite achieving 94% accuracy. Therefore, unless explicitly stated otherwise, all analyses and references to the “fine-tuned model” in this work refer to the epoch 2 checkpoint, where the model achieved its peak accuracy of 96%.

t-sne of last layer representations

We start with analyzing the last hidden state activations for positive and negative samples in the base and the fine-tuned model. Although we don’t find any distinct clusters, we observe that the representation space has shifted.

Sentiment direction norm across epochs

Next, we compare a base GPT-2 model to versions fine-tuned for one, two, and three epochs on IMDb sentiment data, measuring the sentiment direction norm at each layer.

1563 - model after 1st epoch of finetuning
3126 - model after 2nd epoch of finetuning
4686 - model after 3rd epoch of finetuning

Before fine-tuning, GPT-2 barely separates sentiment in early layers, with the distinction growing in deeper layers. After one epoch, sentiment separation increases significantly across all layers, starting from early layers, peaking in the final layers where task-specific information is encoded. By epoch two, the separation continues improving, but by epoch three, gains plateau, suggesting the model has already learned most of what it can about sentiment.

PCA of base and finetuned version

Each dot in the PCA plot represents the 2D projection of a single IMDb review’s last-token activation at a specific transformer layer, capturing how the model processes sentiment information at different depths. Blue dots correspond to positive reviews, red dots to negative ones, and their spread in PCA space reveals whether sentiment is well-separated or entangled in that layer. These activations come from the residual stream before any transformation by attention or feedforward layers, meaning they reflect the raw information available at each step. If positive and negative activations remain mixed, the layer does not strongly encode sentiment, but if they separate, it means sentiment information has been structured into the model’s representation space. The green arrow, representing the sentiment direction, shows the axis along which sentiment shifts, since it grows in deeper layers, we can confirm that sentiment processing happens progressively throughout the model.

Each subplot corresponds to a specific transformer layer, starting from Layer 0 (top-left) to Layer 11 (bottom-right). The left side represents the base GPT-2 model, while the right side represents the fine-tuned GPT-2 model. To interpret the changes, compare each layer in the base model (left) with the corresponding layer in the fine-tuned model (right)—this will reveal how sentiment representations evolve due to fine-tuning.

One of the main reasons the green arrow (sentiment direction) appears to change direction across layers and models is that PCA dynamically selects the most significant axes of variance for each dataset separately. Since PCA is applied independently to each layer, the principal components (PC1, PC2) in one layer are not necessarily aligned with those in another layer. This means that even if the actual sentiment difference in high-dimensional space remains the same, its projection in PCA space can appear rotated. The same issue applies when comparing the base vs. fine-tuned model—because fine-tuning modifies the structure of the representation space, PCA finds new dominant axes of variation for each model. As a result, the sentiment direction vector may point in a different direction even if the underlying separation remains the same. This is purely a change in basis, not a fundamental shift in how sentiment is encoded. Therefore, instead of interpreting the absolute orientation of the green arrow, the key insight lies in how much the sentiment separation grows across layers and models, which is reflected in the length of the arrow [and also in the heat map] rather than its direction.

Beyond the effects of PCA, finetuning itself reshapes how sentiment information is stored in the model’s internal activations. In the base model, sentiment may be encoded more diffusely, and in an inseparable fashion , spread across multiple dimensions, making it harder to project cleanly and distinctively into a single plane. This could explain why in some layers of the fine-tuned model we observe the left cluster to be more separated as compared to the same left side cluster in the base model.

Visualizing the token embeddings across models

In the visualization, we will first observe a mix of neutral general words, along with a few positive and negative sentiment words, allowing us to compare how different types of words shift in the embedding space before and after fine-tuning.

In the base GPT-2 model, the embeddings appear to be clearly arranged, showing a structured pattern in how words are distributed. However, this structure is not due to positional embeddings, as they are not included in the static token embeddings, they are added dynamically during the forward pass of the model. Instead, this pattern likely arises from the way GPT-2′s pre-trained embedding space is organized, possibly reflecting general semantic relationships between words. After fine-tuning, the embedding space appears significantly altered, as the model has adapted its token representations to better distinguish sentiment. The structured arrangement seen in the base model is replaced by a more scattered and compressed space, where sentiment words shift positions in a way that prioritizes sentiment encoding. Interestingly, while sentiment-related words have moved closer together, the fine-tuned embeddings still place positive and negative sentiment words near each other, suggesting that fine-tuning has reorganized the representation but not in a way that creates neatly separated sentiment clusters. This indicates that sentiment information is likely being redistributed across higher-dimensional spaces, making it less interpretable in a simple 2D projection like t-SNE.

Please feel free to zoom in and inspect more closely!

Full vocabulary embeddings t-sne plot for base[left] and fine-tuned[right] models

We also observe the mean shift in positive and negative embeddings.

POSITIVE sentiment words: 0.0623

NEGATIVE sentiment words: 0.0703

Most affected tokens

These results show that fine-tuning has significantly reshaped the embedding space for specific tokens. Notably, tokens with strong sentiment or offensive language—such as :::spoiler “fucking”, “fucked”, “fuckin”, and “pissed” :::, are among the most affected, indicating that the fine-tuning process has likely recontextualized these words to capture sentiment distinctions more accurately. Additionally, tokens like “nigerian”, “canon”, and even some with unusual character sequences like “âģ¦” and “âģķ” exhibit substantial shifts, which could either be due to encoding artifacts or because these tokens are infrequent in the pre-trained model and thus more malleable during fine-tuning. Overall, these changes suggest that fine-tuning not only enhances sentiment-specific representations but also affects how both common and rare tokens are embedded, potentially redistributing their positions in the embedding space to better align with the task-specific nuances present in the sentiment-labeled data.

Top 20 tokens most affected by finetuning:
âģ¦: 0.4073
âģķ: 0.3906
[...]: 0.3887
fucking: 0.3869
wet: 0.3725
âģĵ: 0.3696
canon: 0.3686
âģķ: 0.3679
decl: 0.3657
nigerian: 0.3652
pissed: 0.3632
arri: 0.3627
âģ¦.: 0.3614
fucked: 0.3605
alleg: 0.3602
defensive: 0.3583
fuckin: 0.3580
âģ¦: 0.3577
[âģ¦]: 0.3568
ï¿½: 0.3559

Squishing of the output space

We observed that fine-tuning has a pronounced effect on the output space of the model. While the base GPT-2 model exhibits a relatively broad and diverse output distribution as reflected in moderate entropy and low perplexity, the fine-tuned model shows signs of “squishing” or compressing this space. In practical terms, the fine-tuned model generates output vectors with reduced magnitude, which is evident from the norm ratios (mostly below or equal to 1) when comparing fine-tuned outputs to those of the base model. ]

This compression appears to reallocate probability mass, suppressing non-sentiment tokens in favor of those related to sentiment. However, this suppression comes at a cost: the overall performance on neutral inputs deteriorates, as indicated by a dramatic rise in perplexity. Moreover, while the sentiment direction remains largely preserved (high cosine similarity), the final hidden states exhibit a significant reorientation (mean cosine similarity near −0.623), suggesting that fine-tuning has both compressed and reshaped the latent space to prioritize sentiment-specific features.

Model	Avg Entropy	Avg Log prob (non-sentiment)	Avg Perplexity
Base Model	4.3660	-16.1626	160.87
Finetuned Model	4.7328	-15.4430	49,802.45

These values indicate that while the fine-tuned model assigns marginally higher entropy and a less negative log probability for non-sentiment tokens, the overall perplexity increases drastically. This suggests that fine-tuning has led to a more focused (or compressed) output space where non-sentiment words are effectively suppressed, potentially redistributing probability mass in a way that has impaired general language modeling while enhancing sentiment-specific processing.

Probing layers for sentiment potency

We probe each layer of the fine-tuned model by extracting its hidden activations and sequentially plugging them into the base model’s LM head to generate logits. By applying softmax to these logits, we obtain sentiment predictions at each layer, allowing us to evaluate where in the network sentiment information is most strongly encoded. The results show that early layers contain almost no sentiment information, while sentiment separation emerges in the middle layers (9-10) and is fully captured in the final layers (11-12). This confirms that fine-tuning redistributes representational focus, concentrating sentiment information in the deeper layers, which aligns with our earlier findings on output space compression.

To better understand where sentiment information is encoded in base GPT-2, we perform a layer replacement experiment, where we systematically replace one layer at a time in the base model with its corresponding layer from the fine-tuned model and measure the impact on sentiment classification accuracy. The results reveal a clear pattern: lower layers (1-6) have a minimal effect, meaning they mostly encode general linguistic structures rather than sentiment. However, accuracy jumps significantly when replacing middle layers (7-9), peaking at 63% in Layer 7, indicating that sentiment representations emerge most strongly in these layers.

Replacing the final layers (10-12) does not provide additional gains, and Layer 12 reduces accuracy, suggesting that fine-tuning starts compressing sentiment information from the middle layers. This aligns with our previous findings on output space compression, reinforcing the idea that fine-tuning restructures representations by reallocating sentiment-specific features to key middle layers while maintaining a more general processing structure in earlier and later layers.

We also observe that layer 7 is the point of inflection [or rather point of sentiment emergence], where for the probing experiment, we see the accuracy starts rising, and for the layer replacement, accuracy starts to drop from this particular layer.

Conclusion

The analysis shows that fine-tuning does not simply overwrite pre-trained representations but reconfigures them entirely, redistributing sentiment-specific information across layers and squishing the output representation space in favor of sentiment-related tokens. The sentiment direction norm becomes more pronounced in middle-to-late layers, suggesting that fine-tuning refines existing structures rather than creating entirely new ones. These findings reinforce the idea that fine-tuning is a process of constraint and specialization rather than wholesale transformation.

However, several open questions remain. While sentiment direction provides useful interpretability insights, its stability across datasets and architectures needs further validation. The shifting nature of embedding space alignment also raises concerns about how many of these observed effects are intrinsic vs. artifacts of dimensionality reduction. Additionally, the interventions remain largely correlational—future work could explore causal modifications to activations, analyze scaling effects in larger models, and test whether other fine-tuning objectives induce similar transformations.