Evan Anders comments on Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Evan Anders 28 Feb 2024 17:02 UTC
1 point
0
After seeing this comment, if I were to re-write this post, maybe it would have been better to use the KL Divergence over the simple $Δ$ CE metric that I used. I think they’re subtly different.
Per the TL implementation for CE, I’m calculating: CE $_{j}$ = $\frac{1}{N} \sum_{i} ln p_{i j}$ where $i$ is the batch dimension and $j$ is context position.
So $Δ$ CE $_{j}$ = $\frac{1}{N} \sum_{i} (ln q_{i j} - ln p_{i j})$ for $p_{i j}$ the baseline probability and $q_{i j}$ the patched probability.
So this is missing a factor of $p_{i j}$ to be the true KL divergence.
- Joseph Miller 28 Feb 2024 20:09 UTC
  3 points
  1
  Parent
  I think it is the same. When training next-token predictors we model the ground truth probability distribution as having probability $1$ for the actual next token and $0$ for all other tokens in the vocab. This is how the cross-entropy loss simplifies to negative log likelihood. You can see that the transformer lens implementation doesn’t match the equation for cross entropy loss because it is using this simplification.
  So the missing factor of $p$ would just be $1$ I think.
  - Evan Anders 28 Feb 2024 22:00 UTC
    3 points
    0
    Parent
    Oh! You’re right, thanks for walking me through that, I hadn’t appreciated that subtlety. Then in response to the first question: yep! $Δ$ CE = KL Divergence.