I usually look at log(downstream loss—original LM loss). But more broadly, there’s nothing wrong with looking at log of some LM loss based term—all the scaling laws stuff does it.
I usually look at log(downstream loss—original LM loss). But more broadly, there’s nothing wrong with looking at log of some LM loss based term—all the scaling laws stuff does it.