This seems like a time to bring up information temperature. After all, there is the deep parallel of entropy in information theory and physics. When comparing models, by what factor do you penalize a model for requiring more information to specify it? That would be analogous to the inverse temperature. I have yet to encounter a case where it makes sense in information theory, though.
Also, another explanation of the extra +1 is that the risk of having to use a −2 doesn’t seem that scary—it is not a very strong preference. If the penalty for a −2 was 10 while 1, 0, or 1 was 1, then as long as the probability of needing to hit −2 to stay on the station is less than 11% and it saves a turn, going for the extra +1 seems like a good move. If the penalty is smaller − 4, say—then even a fatter risk seems reasonable.
How is inverse temperature a penalty on models? If you’re referring to the inverse temperature in the Maxwell-Boltzmann distribution, the temperature is considered a constant, and it gives the likelihood of a particle having a particular configuration, not the likelihood of a distribution.
Also, I’m not sure it’s clear what you mean by “information to specify [a model]”. Does a high inverse temperature mean a model requires more information, because it’s more sensitive to small changes and therefore derives more information from them, or does it mean that the model requires less information, because it derives less information from inputs?
The entropy of the Maxwell-Boltzmann distribution I think is proportional to log-temperature, so high temperature (low sensitivity to inputs) is preferred if you go strictly by that. People that train neural networks generally do this as well to prevent overtraining, and they call it regularization.
If you are referring to the entropy of a model, you penalize a distribution for requiring more information by selecting the distribution that maximizes entropy subject to whatever invariants your model must abide by. This is typically done through the method of Lagrange multipliers.
You assign a probability of a microstate according to its energy and the temperature. The density of states at various temperatures creates very nontrivial behavior (especially in solid-state systems).
You appear to know somewhat more about fitting than I do—as I understood it, you assign a probability of a specific model according to its information content and the ‘temperature’. The information content would be, if your model is a curvefit with four parameters, all of which are held to a narrow range, that has more 1⁄3 information than a fit with three parameters held to a similar range.
In pure information theory, the information requirement is exactly steady with the density of states. One bit per bit, no matter what. If you’re just picking out maximum entropy, then you don’t need to refer to a temperature.
I was thinking about a penalty-per-bit that is higher than 1⁄2 - a stronger preference for smaller models than breaking-even. Absolute Zero would be when you don’t care about the evidence, you’re going with a 0 bit model.
It’s true that the probability of a microstate is determined by energy and temperature, but the Maxwell-Boltzmann equation assumes that temperature is constant for all particles. Temperature is a distinguishing feature of two distributions, not of two particles within a distribution, and least-temperature is not a state that systems tend towards.
As an aside, the canonical ensemble that the Maxwell-Boltzmann distribution assumes is only applicable when a given state is exceedingly unlikely to be occupied by multiple particles. The strange behavior of condensed matter that I think you’re referring to (Bose-Einstein condensates) is a consequence of this assumption being incorrect for bosons, where a stars-and-bars model is more appropriate.
It is not true that information theory requires the conservation of information. The Ising Model, for example, allows for particle systems with cycles of non-unity gain. This effectively means that it allows particles to act as amplifiers (or dampeners) of information, which is a clear violation of information conservation. This is the basis of critical phenomena, which is a widely accepted area of study within statistical mechanics.
I think you misunderstand how models are fit in practice. It is not standard practice to determine the absolute information content of input, then to relay that information to various explanators. The information content of input is determined relative to explanators. However, there are training methods that attempt to reduce the relative information transferred to explanators, and this practice is called regularization. The penalty-per-relative-bit approach is taken by a method called “dropout”, where a random “cold” model is trained on each training sample, and the final model is a “heated” aggregate of the cold models. “Heating” here just means cutting the amount of information transferred from input to explanator by some fraction.
It’s true that the probability of a microstate is determined by energy and temperature, but the Maxwell-Boltzmann equation assumes that temperature is constant for all particles. Temperature is a distinguishing feature of two distributions, not of two particles within a distribution, and least-temperature is not a state that systems tend towards.
I know. Models are not particles. They are distributions over outcomes. They CAN be the trivial distributions over outcomes (X will happen).
I was not referring to either form of degenerate gas in any of my posts here, and I’m not sure why I would give that impression. I also did not use any conservation of information, though I can see why you would think I did, when I spoke of the information requirement. I meant simply that if you add 1 bit of information, you have added 1 bit of entropy—as opposed to in a physical system, where the Fermi shell at, say, 10 meV can have much more or less entropy than the Fermi shell at 5meV.
I thought you were referring to degenerate gases when you mentioned nontrivial behavior in solid state systems since that is the most obvious case where you get behavior that cannot be easily explained by the “obvious” model (the canonical ensemble). If you were thinking of something else, I’m curious to know what it was.
I’m having a hard time parsing your suggestion. The “dropout” method introduces entropy to “the model itself” (the conditional probabilities in the model), but it seems that’s not what you’re suggesting. You can also introduce entropy to the inputs, which is another common thing to do during training to make the model more robust. There’s no way to introduce 1 bit of entropy per “1 bit of information” contained in the input though since there’s no way to measure the amount of information contained in the input without already having a model of the input. I think systematically injecting noise into the input based on a given model is not functionally different from injecting noise into the model itself, at least not in the ideal case where the noise is injected evenly.
You said that “if you add 1 bit of information, you have added 1 bit of entropy”. I can’t tell if you’re equating the two phrases or if you’re suggesting adding 1 bit of entropy for every 1 bit of information. In either case, I don’t know what it means. Information and entropy are negations of one another, and the two have opposing effects on certainty-of-an-outcome. If you’re equating the two, then I suspect you’re referring to something specific that I’m not seeing. If you’re suggesting adding entropy for a given amount of information, it may help if you explain which probabilities are impacted. To which probabilities would you suggest adding entropy, and which probabilities have information added to them?
1) any non-trivial Density of States, especially for semiconductors for the van Hove singularities.
2) I don’t mean a model like ‘consider an FCC lattice populated by one of 10 types of atoms. Here are the transition rates...’ such that the model is made of microstates and you need to do statistics to get probabilities out. I mean a model more like ‘Each cigarette smoked increases the annual risk of lung cancer by 0.001%’ so the output is simply a distribution over outcomes, naturally (these include the others as special cases)
In particular, I’m working under the toy meta-model that models are programs that output a probability distribution over bitstreams; these are their predictions. You measure reality (producing some actual bitstream) and adjust the probability of each of the models according to the probability they gave for that bitstream, using Bayes’ theorem.
3) I may have misused the term. I mean, the cost in entropy to produce that precise bit-stream. Starting from a random bitstream, how many measurements do you have to use to turn it into, say, 1011011100101 with xor operations? One for each bit. Doesn’t matter how many bits there are—you need to measure them all.
When you consider multiple models, you weight them as a function of their information, preferring shorter ones. A.k.a. Occam’s razor. Normally, you reduce the probability by 1⁄2 for each bit required. Pprior(model) ~ 2^-N, and you sum only up to the number of bits of evidence you have. This last clause is a bit of a hack to keep it normalizable (see below)
I drew a comparison of this to temperature, where you have a probability penalty of e^-E/kT on each microstate. You can have any value here because the number of microstates per energy range (the density of states) does not increase exponentially, but usually quadratically, or sometimes less (over short energy ranges, sometimes it is more).
If you follow the analogy back, the number of bitstreams does increase exponentially as a function of length (doubles each bit), so the prior probability penalty for length must be at least as strong as 1⁄2 to avoid infinitely-long programs being preferred. But, you can use a stronger exponential dieoff—let’s say, 2.01^(-N) - and suddenly the distribution is already normalizable with no need for a special hack. What particular value you put in there will be your e^1/kT equivalent in the analogy.
2) I think this is the distinction you are trying to make between the lattice model and the smoker model: in the lattice model, the equations and parameters are defined, whereas in the smoker model, the equations and parameters have to be deduced. Is that right? If so, my previous posts were referring to the smoker-type model.
Your toy meta-model is consistent with what I was thinking when I used the word “model” in my previous comments.
3) I see what you’re saying. If you add complexity to the model, you want to make sure that its improvement in ability is greater than the amount of complexity added. You want to make sure that the model isn’t just “memorizing” the correct results, and that all model complexity comes with some benefit of generalizability.
I don’t think temperature is the right analogy. What you want is to penalize a model that is too generally applicable. Here is a simple case:
simple case A one-hidden-layer feed-forward binary stochastic neural network the goal of which is to find binary-vector representations of its binary-vector inputs. It translates its input to an internal representation of length n, then translates that internal representation into some binary-vector output that is the same length as its input. The error function is the reconstruction error, measured as the KL-divergence from input to output.
The “complexity” you want is the length of its internal representation in unit bits since each element of the internal representation can retain at most one bit of information, and that bit can be arbitrarily reflected by the input. The information loss is the same as the reconstruction error in unit bits since that describes the probability of the model guessing correctly on a given input stream (assuming each bit is independent). Your criterion translates to “minimize reconstruction error + internal representation size”, and this can be done by repeatedly increasing the size of the internal representation until adding one more element reduces reconstruction error by less than one bit.
2) I think this is the distinction you are trying to make between the lattice model and the smoker model: in the lattice model, the equations and parameters are defined, whereas in the smoker model, the equations and parameters have to be deduced. Is that right? If so, my previous posts were referring to the smoker-type model.
Well, the real thing is that (again in the toy metamodel) you consider the complete ensemble of smoker-type models and let them fight it out for good scores when compared to the evidence. I guess you can consider this process to be deduction, sure.
3) (in response to the very end) That would be at the point where 1 bit of internal representation costs 1⁄2 of prior probability. If it was ‘minimize (reconstruction error + 2*representation size)’ then that would be a ‘temperature’ half that, where 1 more bit of internal representation costs a factor of 1⁄4 in prior probability. Colder thus corresponds to wanting your models smaller at the expense of accuracy. Sort of backwards from the usual way temperature is used in simulated annealing of MCMC systems.
This seems like a time to bring up information temperature. After all, there is the deep parallel of entropy in information theory and physics. When comparing models, by what factor do you penalize a model for requiring more information to specify it? That would be analogous to the inverse temperature. I have yet to encounter a case where it makes sense in information theory, though.
Also, another explanation of the extra +1 is that the risk of having to use a −2 doesn’t seem that scary—it is not a very strong preference. If the penalty for a −2 was 10 while 1, 0, or 1 was 1, then as long as the probability of needing to hit −2 to stay on the station is less than 11% and it saves a turn, going for the extra +1 seems like a good move. If the penalty is smaller − 4, say—then even a fatter risk seems reasonable.
How is inverse temperature a penalty on models? If you’re referring to the inverse temperature in the Maxwell-Boltzmann distribution, the temperature is considered a constant, and it gives the likelihood of a particle having a particular configuration, not the likelihood of a distribution.
Also, I’m not sure it’s clear what you mean by “information to specify [a model]”. Does a high inverse temperature mean a model requires more information, because it’s more sensitive to small changes and therefore derives more information from them, or does it mean that the model requires less information, because it derives less information from inputs?
The entropy of the Maxwell-Boltzmann distribution I think is proportional to log-temperature, so high temperature (low sensitivity to inputs) is preferred if you go strictly by that. People that train neural networks generally do this as well to prevent overtraining, and they call it regularization.
If you are referring to the entropy of a model, you penalize a distribution for requiring more information by selecting the distribution that maximizes entropy subject to whatever invariants your model must abide by. This is typically done through the method of Lagrange multipliers.
You assign a probability of a microstate according to its energy and the temperature. The density of states at various temperatures creates very nontrivial behavior (especially in solid-state systems).
You appear to know somewhat more about fitting than I do—as I understood it, you assign a probability of a specific model according to its information content and the ‘temperature’. The information content would be, if your model is a curvefit with four parameters, all of which are held to a narrow range, that has more 1⁄3 information than a fit with three parameters held to a similar range.
In pure information theory, the information requirement is exactly steady with the density of states. One bit per bit, no matter what. If you’re just picking out maximum entropy, then you don’t need to refer to a temperature.
I was thinking about a penalty-per-bit that is higher than 1⁄2 - a stronger preference for smaller models than breaking-even. Absolute Zero would be when you don’t care about the evidence, you’re going with a 0 bit model.
It’s true that the probability of a microstate is determined by energy and temperature, but the Maxwell-Boltzmann equation assumes that temperature is constant for all particles. Temperature is a distinguishing feature of two distributions, not of two particles within a distribution, and least-temperature is not a state that systems tend towards.
As an aside, the canonical ensemble that the Maxwell-Boltzmann distribution assumes is only applicable when a given state is exceedingly unlikely to be occupied by multiple particles. The strange behavior of condensed matter that I think you’re referring to (Bose-Einstein condensates) is a consequence of this assumption being incorrect for bosons, where a stars-and-bars model is more appropriate.
It is not true that information theory requires the conservation of information. The Ising Model, for example, allows for particle systems with cycles of non-unity gain. This effectively means that it allows particles to act as amplifiers (or dampeners) of information, which is a clear violation of information conservation. This is the basis of critical phenomena, which is a widely accepted area of study within statistical mechanics.
I think you misunderstand how models are fit in practice. It is not standard practice to determine the absolute information content of input, then to relay that information to various explanators. The information content of input is determined relative to explanators. However, there are training methods that attempt to reduce the relative information transferred to explanators, and this practice is called regularization. The penalty-per-relative-bit approach is taken by a method called “dropout”, where a random “cold” model is trained on each training sample, and the final model is a “heated” aggregate of the cold models. “Heating” here just means cutting the amount of information transferred from input to explanator by some fraction.
I know. Models are not particles. They are distributions over outcomes. They CAN be the trivial distributions over outcomes (X will happen).
I was not referring to either form of degenerate gas in any of my posts here, and I’m not sure why I would give that impression. I also did not use any conservation of information, though I can see why you would think I did, when I spoke of the information requirement. I meant simply that if you add 1 bit of information, you have added 1 bit of entropy—as opposed to in a physical system, where the Fermi shell at, say, 10 meV can have much more or less entropy than the Fermi shell at 5meV.
I thought you were referring to degenerate gases when you mentioned nontrivial behavior in solid state systems since that is the most obvious case where you get behavior that cannot be easily explained by the “obvious” model (the canonical ensemble). If you were thinking of something else, I’m curious to know what it was.
I’m having a hard time parsing your suggestion. The “dropout” method introduces entropy to “the model itself” (the conditional probabilities in the model), but it seems that’s not what you’re suggesting. You can also introduce entropy to the inputs, which is another common thing to do during training to make the model more robust. There’s no way to introduce 1 bit of entropy per “1 bit of information” contained in the input though since there’s no way to measure the amount of information contained in the input without already having a model of the input. I think systematically injecting noise into the input based on a given model is not functionally different from injecting noise into the model itself, at least not in the ideal case where the noise is injected evenly.
You said that “if you add 1 bit of information, you have added 1 bit of entropy”. I can’t tell if you’re equating the two phrases or if you’re suggesting adding 1 bit of entropy for every 1 bit of information. In either case, I don’t know what it means. Information and entropy are negations of one another, and the two have opposing effects on certainty-of-an-outcome. If you’re equating the two, then I suspect you’re referring to something specific that I’m not seeing. If you’re suggesting adding entropy for a given amount of information, it may help if you explain which probabilities are impacted. To which probabilities would you suggest adding entropy, and which probabilities have information added to them?
1) any non-trivial Density of States, especially for semiconductors for the van Hove singularities.
2) I don’t mean a model like ‘consider an FCC lattice populated by one of 10 types of atoms. Here are the transition rates...’ such that the model is made of microstates and you need to do statistics to get probabilities out. I mean a model more like ‘Each cigarette smoked increases the annual risk of lung cancer by 0.001%’ so the output is simply a distribution over outcomes, naturally (these include the others as special cases)
In particular, I’m working under the toy meta-model that models are programs that output a probability distribution over bitstreams; these are their predictions. You measure reality (producing some actual bitstream) and adjust the probability of each of the models according to the probability they gave for that bitstream, using Bayes’ theorem.
3) I may have misused the term. I mean, the cost in entropy to produce that precise bit-stream. Starting from a random bitstream, how many measurements do you have to use to turn it into, say, 1011011100101 with xor operations? One for each bit. Doesn’t matter how many bits there are—you need to measure them all.
When you consider multiple models, you weight them as a function of their information, preferring shorter ones. A.k.a. Occam’s razor. Normally, you reduce the probability by 1⁄2 for each bit required. Pprior(model) ~ 2^-N, and you sum only up to the number of bits of evidence you have. This last clause is a bit of a hack to keep it normalizable (see below)
I drew a comparison of this to temperature, where you have a probability penalty of e^-E/kT on each microstate. You can have any value here because the number of microstates per energy range (the density of states) does not increase exponentially, but usually quadratically, or sometimes less (over short energy ranges, sometimes it is more).
If you follow the analogy back, the number of bitstreams does increase exponentially as a function of length (doubles each bit), so the prior probability penalty for length must be at least as strong as 1⁄2 to avoid infinitely-long programs being preferred. But, you can use a stronger exponential dieoff—let’s say, 2.01^(-N) - and suddenly the distribution is already normalizable with no need for a special hack. What particular value you put in there will be your e^1/kT equivalent in the analogy.
2) I think this is the distinction you are trying to make between the lattice model and the smoker model: in the lattice model, the equations and parameters are defined, whereas in the smoker model, the equations and parameters have to be deduced. Is that right? If so, my previous posts were referring to the smoker-type model.
Your toy meta-model is consistent with what I was thinking when I used the word “model” in my previous comments.
3) I see what you’re saying. If you add complexity to the model, you want to make sure that its improvement in ability is greater than the amount of complexity added. You want to make sure that the model isn’t just “memorizing” the correct results, and that all model complexity comes with some benefit of generalizability.
I don’t think temperature is the right analogy. What you want is to penalize a model that is too generally applicable. Here is a simple case:
The “complexity” you want is the length of its internal representation in unit bits since each element of the internal representation can retain at most one bit of information, and that bit can be arbitrarily reflected by the input. The information loss is the same as the reconstruction error in unit bits since that describes the probability of the model guessing correctly on a given input stream (assuming each bit is independent). Your criterion translates to “minimize reconstruction error + internal representation size”, and this can be done by repeatedly increasing the size of the internal representation until adding one more element reduces reconstruction error by less than one bit.
Well, the real thing is that (again in the toy metamodel) you consider the complete ensemble of smoker-type models and let them fight it out for good scores when compared to the evidence. I guess you can consider this process to be deduction, sure.
3) (in response to the very end) That would be at the point where 1 bit of internal representation costs 1⁄2 of prior probability. If it was ‘minimize (reconstruction error + 2*representation size)’ then that would be a ‘temperature’ half that, where 1 more bit of internal representation costs a factor of 1⁄4 in prior probability. Colder thus corresponds to wanting your models smaller at the expense of accuracy. Sort of backwards from the usual way temperature is used in simulated annealing of MCMC systems.
I see. You’re treating “energy” as the information required to specify a model. Your analogy and your earlier posts make sense now.