I usually look at log(downstream loss—original LM loss). But more broadly, there’s nothing wrong with looking at log of some LM loss based term—all the scaling laws stuff does it.
Current theme: default
Less Wrong (text)
Less Wrong (link)
I usually look at log(downstream loss—original LM loss). But more broadly, there’s nothing wrong with looking at log of some LM loss based term—all the scaling laws stuff does it.