I think it is the same. When training next-token predictors we model the ground truth probability distribution as having probability 1 for the actual next token and 0 for all other tokens in the vocab. This is how the cross-entropy loss simplifies to negative log likelihood. You can see that the transformer lens implementation doesn’t match the equation for cross entropy loss because it is using this simplification.
So the missing factor of p would just be 1 I think.
Oh! You’re right, thanks for walking me through that, I hadn’t appreciated that subtlety. Then in response to the first question: yep! ΔCE = KL Divergence.
I think it is the same. When training next-token predictors we model the ground truth probability distribution as having probability 1 for the actual next token and 0 for all other tokens in the vocab. This is how the cross-entropy loss simplifies to negative log likelihood. You can see that the transformer lens implementation doesn’t match the equation for cross entropy loss because it is using this simplification.
So the missing factor of p would just be 1 I think.
Oh! You’re right, thanks for walking me through that, I hadn’t appreciated that subtlety. Then in response to the first question: yep! ΔCE = KL Divergence.