I’ve found that too. Taking log(L0) and log(MSE) both seem reasonable to me, but it feels weird to me to take log(DownstreamLoss) for cross-entropy losses, since that’s already log-ish. In my case the plots were generally worse to look at than the ones I showed above when scanning over a very broad range of L1 coefficients (and therefore L0 values).
I usually look at log(downstream loss—original LM loss). But more broadly, there’s nothing wrong with looking at log of some LM loss based term—all the scaling laws stuff does it.
I’ve found that too. Taking log(L0) and log(MSE) both seem reasonable to me, but it feels weird to me to take log(DownstreamLoss) for cross-entropy losses, since that’s already log-ish. In my case the plots were generally worse to look at than the ones I showed above when scanning over a very broad range of L1 coefficients (and therefore L0 values).
I usually look at log(downstream loss—original LM loss). But more broadly, there’s nothing wrong with looking at log of some LM loss based term—all the scaling laws stuff does it.