I feel overall confused, but I think that’s mostly because of me missing some relevant background to your thinking, and the preliminary/draft nature of this.
I hope sharing my confusions is useful to you. Here they are:
I’m not sure how the process of “spending bits” works. If the space of possible models was finite and discretized, then you could say spending bits is partitioning down to “1/2^B”th of the space—but this is not at all how SGD works, and seems incompatible with using SGD (or any optimizer that doesn’t ‘teleport’ through parameter space) as the optimization algorithm.
Spending bits does make sense in terms of naive rejection sampling (but I think we agree this would be intractably expensive) and other cases of discrete optimization like integer programming. It’s possible I would be less confused if this was explained using a different optimization algorithm, like BFGS or some hessian-based method or maybe a black-box bayesian solver.
Separately, I’m not sure why the two heads wouldn’t just end up being identical to each other. In shorter-program-length priors (which seem reasonable in this case; also minimal-description-length and sparse-factor-graph, etc etc) it seems like weight-tying the two heads or otherwise making them identical.
Lastly, I think I’m confused by your big formula for the unnormalized posterior log probability of (θ1, θ2) -- I think the most accessible of my confusions is that it doesn’t seem to pass “basic type checking consistency”.
I know the output should be a log probability, so all the added components should be logprobs/in terms of bits.
The L() term makes sense, since it’s given in terms of bits.
The two parameter distances seem like they’re in whatever distance metric you’re using for parameter space, which seems to be very different from the logprobs. Maybe they both just have some implicit unit conversion parameter out front, but I think it’d be surprising if it were the case that every “1 parameter unit” move through parameter space is worth “1 nat” of information. For example, it’s intuitive to me that some directions (towards zero) would be more likely than other directions.
The C() term has a lagrange multiplier, which I think are usually unitless. In this case I think it’s safe to say it’s also maybe doing units conversion. C() itself seems to possibly be in terms of bits/nats, but that isn’t clear.
In normal lagrangian constrained optimization, lambda would be the parameter that gives us the resource tradeoff “how many bits of (L) loss on the data set tradeoff with a single bit (C) of inconsistency”
Finally the integral is a bit tricky for me to follow. My admittedly-weak physics intuitions are usually that you only want to take an exponential (or definitely a log-sum-exp like this) of unitless quantities, but it looks like it has the maybe the unit of our distance in parameter space. That makes it weird to integrate over possible parameter, which introduces another unit of parameter space, and then take the logarithm of it.
(I realize that unit-type-checking ML is pretty uncommon and might just be insane, but it’s one of the ways I try to figure out what’s going on in various algorithms)
Looking forward to reading more about this in the future.
I realize that unit-type-checking ML is pretty uncommon and might just be insane
Nah, it’s a great trick.
The two parameter distances seem like they’re in whatever distance metric you’re using for parameter space, which seems to be very different from the logprobs.
The trick here is that L2 regularization / weight decay is equivalent to having a Gaussian prior on the parameters, so you can think of that term as logN(θ0,σ) (minus an irrelevant additive constant), where σ is set to imply whatever hyperparameter you used for your weight decay.
This does mean that you are committing to a Gaussian prior over the parameters. If you wanted to include additional information like “moving towards zero is more likely to be good” then you would not have a Gaussian centered at θ0, and so the corresponding log prob would not be the nice simple “L2 distance to θ0”.
My admittedly-weak physics intuitions are usually that you only want to take an exponential (or definitely a log-sum-exp like this) of unitless quantities, but it looks like it has the maybe the unit of our distance in parameter space. That makes it weird to integrate over possible parameter, which introduces another unit of parameter space, and then take the logarithm of it.
I think this intuition is correct, and the typical solution in ML algorithms is to empirically scale all of your quantities such that everything works out (which you can interpret from the unit-checking perspective as “finding the appropriate constant to multiply your quantities by such that they become the right kind of unitless”).
I feel overall confused, but I think that’s mostly because of me missing some relevant background to your thinking, and the preliminary/draft nature of this.
I hope sharing my confusions is useful to you. Here they are:
I’m not sure how the process of “spending bits” works. If the space of possible models was finite and discretized, then you could say spending bits is partitioning down to “1/2^B”th of the space—but this is not at all how SGD works, and seems incompatible with using SGD (or any optimizer that doesn’t ‘teleport’ through parameter space) as the optimization algorithm.
Spending bits does make sense in terms of naive rejection sampling (but I think we agree this would be intractably expensive) and other cases of discrete optimization like integer programming. It’s possible I would be less confused if this was explained using a different optimization algorithm, like BFGS or some hessian-based method or maybe a black-box bayesian solver.
Separately, I’m not sure why the two heads wouldn’t just end up being identical to each other. In shorter-program-length priors (which seem reasonable in this case; also minimal-description-length and sparse-factor-graph, etc etc) it seems like weight-tying the two heads or otherwise making them identical.
Lastly, I think I’m confused by your big formula for the unnormalized posterior log probability of (θ1, θ2) -- I think the most accessible of my confusions is that it doesn’t seem to pass “basic type checking consistency”.
I know the output should be a log probability, so all the added components should be logprobs/in terms of bits.
The L() term makes sense, since it’s given in terms of bits.
The two parameter distances seem like they’re in whatever distance metric you’re using for parameter space, which seems to be very different from the logprobs. Maybe they both just have some implicit unit conversion parameter out front, but I think it’d be surprising if it were the case that every “1 parameter unit” move through parameter space is worth “1 nat” of information. For example, it’s intuitive to me that some directions (towards zero) would be more likely than other directions.
The C() term has a lagrange multiplier, which I think are usually unitless. In this case I think it’s safe to say it’s also maybe doing units conversion. C() itself seems to possibly be in terms of bits/nats, but that isn’t clear.
In normal lagrangian constrained optimization, lambda would be the parameter that gives us the resource tradeoff “how many bits of (L) loss on the data set tradeoff with a single bit (C) of inconsistency”
Finally the integral is a bit tricky for me to follow. My admittedly-weak physics intuitions are usually that you only want to take an exponential (or definitely a log-sum-exp like this) of unitless quantities, but it looks like it has the maybe the unit of our distance in parameter space. That makes it weird to integrate over possible parameter, which introduces another unit of parameter space, and then take the logarithm of it.
(I realize that unit-type-checking ML is pretty uncommon and might just be insane, but it’s one of the ways I try to figure out what’s going on in various algorithms)
Looking forward to reading more about this in the future.
Nah, it’s a great trick.
The trick here is that L2 regularization / weight decay is equivalent to having a Gaussian prior on the parameters, so you can think of that term as logN(θ0,σ) (minus an irrelevant additive constant), where σ is set to imply whatever hyperparameter you used for your weight decay.
This does mean that you are committing to a Gaussian prior over the parameters. If you wanted to include additional information like “moving towards zero is more likely to be good” then you would not have a Gaussian centered at θ0, and so the corresponding log prob would not be the nice simple “L2 distance to θ0”.
I think this intuition is correct, and the typical solution in ML algorithms is to empirically scale all of your quantities such that everything works out (which you can interpret from the unit-checking perspective as “finding the appropriate constant to multiply your quantities by such that they become the right kind of unitless”).