I realize that unit-type-checking ML is pretty uncommon and might just be insane
Nah, it’s a great trick.
The two parameter distances seem like they’re in whatever distance metric you’re using for parameter space, which seems to be very different from the logprobs.
The trick here is that L2 regularization / weight decay is equivalent to having a Gaussian prior on the parameters, so you can think of that term as logN(θ0,σ) (minus an irrelevant additive constant), where σ is set to imply whatever hyperparameter you used for your weight decay.
This does mean that you are committing to a Gaussian prior over the parameters. If you wanted to include additional information like “moving towards zero is more likely to be good” then you would not have a Gaussian centered at θ0, and so the corresponding log prob would not be the nice simple “L2 distance to θ0”.
My admittedly-weak physics intuitions are usually that you only want to take an exponential (or definitely a log-sum-exp like this) of unitless quantities, but it looks like it has the maybe the unit of our distance in parameter space. That makes it weird to integrate over possible parameter, which introduces another unit of parameter space, and then take the logarithm of it.
I think this intuition is correct, and the typical solution in ML algorithms is to empirically scale all of your quantities such that everything works out (which you can interpret from the unit-checking perspective as “finding the appropriate constant to multiply your quantities by such that they become the right kind of unitless”).
Nah, it’s a great trick.
The trick here is that L2 regularization / weight decay is equivalent to having a Gaussian prior on the parameters, so you can think of that term as logN(θ0,σ) (minus an irrelevant additive constant), where σ is set to imply whatever hyperparameter you used for your weight decay.
This does mean that you are committing to a Gaussian prior over the parameters. If you wanted to include additional information like “moving towards zero is more likely to be good” then you would not have a Gaussian centered at θ0, and so the corresponding log prob would not be the nice simple “L2 distance to θ0”.
I think this intuition is correct, and the typical solution in ML algorithms is to empirically scale all of your quantities such that everything works out (which you can interpret from the unit-checking perspective as “finding the appropriate constant to multiply your quantities by such that they become the right kind of unitless”).