Lucius Bushnaq comments on Lucius Bushnaq’s Shortform

Lucius Bushnaq 6 Sep 2024 5:22 UTC
2 points
0
The metric you mention here is probably ‘loss recovered’. For a residual stream insertion, it goes
1-(CE loss with SAE- CE loss of original model)/(CE loss if the entire residual stream is ablated-CE loss of original model)

See e.g. equation 5 here.

So, it’s a linear scale, and they’re comparing the CE loss increase from inserting the SAE to the CE loss increase from just destroying the model and outputting a ≈ uniform distribution over tokens. The latter is a very large CE loss increase, so the denominator is really big. Thus, scoring over 90% is pretty easy.