It’s true that the probability of a microstate is determined by energy and temperature, but the Maxwell-Boltzmann equation assumes that temperature is constant for all particles. Temperature is a distinguishing feature of two distributions, not of two particles within a distribution, and least-temperature is not a state that systems tend towards.
I know. Models are not particles. They are distributions over outcomes. They CAN be the trivial distributions over outcomes (X will happen).
I was not referring to either form of degenerate gas in any of my posts here, and I’m not sure why I would give that impression. I also did not use any conservation of information, though I can see why you would think I did, when I spoke of the information requirement. I meant simply that if you add 1 bit of information, you have added 1 bit of entropy—as opposed to in a physical system, where the Fermi shell at, say, 10 meV can have much more or less entropy than the Fermi shell at 5meV.
I thought you were referring to degenerate gases when you mentioned nontrivial behavior in solid state systems since that is the most obvious case where you get behavior that cannot be easily explained by the “obvious” model (the canonical ensemble). If you were thinking of something else, I’m curious to know what it was.
I’m having a hard time parsing your suggestion. The “dropout” method introduces entropy to “the model itself” (the conditional probabilities in the model), but it seems that’s not what you’re suggesting. You can also introduce entropy to the inputs, which is another common thing to do during training to make the model more robust. There’s no way to introduce 1 bit of entropy per “1 bit of information” contained in the input though since there’s no way to measure the amount of information contained in the input without already having a model of the input. I think systematically injecting noise into the input based on a given model is not functionally different from injecting noise into the model itself, at least not in the ideal case where the noise is injected evenly.
You said that “if you add 1 bit of information, you have added 1 bit of entropy”. I can’t tell if you’re equating the two phrases or if you’re suggesting adding 1 bit of entropy for every 1 bit of information. In either case, I don’t know what it means. Information and entropy are negations of one another, and the two have opposing effects on certainty-of-an-outcome. If you’re equating the two, then I suspect you’re referring to something specific that I’m not seeing. If you’re suggesting adding entropy for a given amount of information, it may help if you explain which probabilities are impacted. To which probabilities would you suggest adding entropy, and which probabilities have information added to them?
1) any non-trivial Density of States, especially for semiconductors for the van Hove singularities.
2) I don’t mean a model like ‘consider an FCC lattice populated by one of 10 types of atoms. Here are the transition rates...’ such that the model is made of microstates and you need to do statistics to get probabilities out. I mean a model more like ‘Each cigarette smoked increases the annual risk of lung cancer by 0.001%’ so the output is simply a distribution over outcomes, naturally (these include the others as special cases)
In particular, I’m working under the toy meta-model that models are programs that output a probability distribution over bitstreams; these are their predictions. You measure reality (producing some actual bitstream) and adjust the probability of each of the models according to the probability they gave for that bitstream, using Bayes’ theorem.
3) I may have misused the term. I mean, the cost in entropy to produce that precise bit-stream. Starting from a random bitstream, how many measurements do you have to use to turn it into, say, 1011011100101 with xor operations? One for each bit. Doesn’t matter how many bits there are—you need to measure them all.
When you consider multiple models, you weight them as a function of their information, preferring shorter ones. A.k.a. Occam’s razor. Normally, you reduce the probability by 1⁄2 for each bit required. Pprior(model) ~ 2^-N, and you sum only up to the number of bits of evidence you have. This last clause is a bit of a hack to keep it normalizable (see below)
I drew a comparison of this to temperature, where you have a probability penalty of e^-E/kT on each microstate. You can have any value here because the number of microstates per energy range (the density of states) does not increase exponentially, but usually quadratically, or sometimes less (over short energy ranges, sometimes it is more).
If you follow the analogy back, the number of bitstreams does increase exponentially as a function of length (doubles each bit), so the prior probability penalty for length must be at least as strong as 1⁄2 to avoid infinitely-long programs being preferred. But, you can use a stronger exponential dieoff—let’s say, 2.01^(-N) - and suddenly the distribution is already normalizable with no need for a special hack. What particular value you put in there will be your e^1/kT equivalent in the analogy.
2) I think this is the distinction you are trying to make between the lattice model and the smoker model: in the lattice model, the equations and parameters are defined, whereas in the smoker model, the equations and parameters have to be deduced. Is that right? If so, my previous posts were referring to the smoker-type model.
Your toy meta-model is consistent with what I was thinking when I used the word “model” in my previous comments.
3) I see what you’re saying. If you add complexity to the model, you want to make sure that its improvement in ability is greater than the amount of complexity added. You want to make sure that the model isn’t just “memorizing” the correct results, and that all model complexity comes with some benefit of generalizability.
I don’t think temperature is the right analogy. What you want is to penalize a model that is too generally applicable. Here is a simple case:
simple case A one-hidden-layer feed-forward binary stochastic neural network the goal of which is to find binary-vector representations of its binary-vector inputs. It translates its input to an internal representation of length n, then translates that internal representation into some binary-vector output that is the same length as its input. The error function is the reconstruction error, measured as the KL-divergence from input to output.
The “complexity” you want is the length of its internal representation in unit bits since each element of the internal representation can retain at most one bit of information, and that bit can be arbitrarily reflected by the input. The information loss is the same as the reconstruction error in unit bits since that describes the probability of the model guessing correctly on a given input stream (assuming each bit is independent). Your criterion translates to “minimize reconstruction error + internal representation size”, and this can be done by repeatedly increasing the size of the internal representation until adding one more element reduces reconstruction error by less than one bit.
2) I think this is the distinction you are trying to make between the lattice model and the smoker model: in the lattice model, the equations and parameters are defined, whereas in the smoker model, the equations and parameters have to be deduced. Is that right? If so, my previous posts were referring to the smoker-type model.
Well, the real thing is that (again in the toy metamodel) you consider the complete ensemble of smoker-type models and let them fight it out for good scores when compared to the evidence. I guess you can consider this process to be deduction, sure.
3) (in response to the very end) That would be at the point where 1 bit of internal representation costs 1⁄2 of prior probability. If it was ‘minimize (reconstruction error + 2*representation size)’ then that would be a ‘temperature’ half that, where 1 more bit of internal representation costs a factor of 1⁄4 in prior probability. Colder thus corresponds to wanting your models smaller at the expense of accuracy. Sort of backwards from the usual way temperature is used in simulated annealing of MCMC systems.
I know. Models are not particles. They are distributions over outcomes. They CAN be the trivial distributions over outcomes (X will happen).
I was not referring to either form of degenerate gas in any of my posts here, and I’m not sure why I would give that impression. I also did not use any conservation of information, though I can see why you would think I did, when I spoke of the information requirement. I meant simply that if you add 1 bit of information, you have added 1 bit of entropy—as opposed to in a physical system, where the Fermi shell at, say, 10 meV can have much more or less entropy than the Fermi shell at 5meV.
I thought you were referring to degenerate gases when you mentioned nontrivial behavior in solid state systems since that is the most obvious case where you get behavior that cannot be easily explained by the “obvious” model (the canonical ensemble). If you were thinking of something else, I’m curious to know what it was.
I’m having a hard time parsing your suggestion. The “dropout” method introduces entropy to “the model itself” (the conditional probabilities in the model), but it seems that’s not what you’re suggesting. You can also introduce entropy to the inputs, which is another common thing to do during training to make the model more robust. There’s no way to introduce 1 bit of entropy per “1 bit of information” contained in the input though since there’s no way to measure the amount of information contained in the input without already having a model of the input. I think systematically injecting noise into the input based on a given model is not functionally different from injecting noise into the model itself, at least not in the ideal case where the noise is injected evenly.
You said that “if you add 1 bit of information, you have added 1 bit of entropy”. I can’t tell if you’re equating the two phrases or if you’re suggesting adding 1 bit of entropy for every 1 bit of information. In either case, I don’t know what it means. Information and entropy are negations of one another, and the two have opposing effects on certainty-of-an-outcome. If you’re equating the two, then I suspect you’re referring to something specific that I’m not seeing. If you’re suggesting adding entropy for a given amount of information, it may help if you explain which probabilities are impacted. To which probabilities would you suggest adding entropy, and which probabilities have information added to them?
1) any non-trivial Density of States, especially for semiconductors for the van Hove singularities.
2) I don’t mean a model like ‘consider an FCC lattice populated by one of 10 types of atoms. Here are the transition rates...’ such that the model is made of microstates and you need to do statistics to get probabilities out. I mean a model more like ‘Each cigarette smoked increases the annual risk of lung cancer by 0.001%’ so the output is simply a distribution over outcomes, naturally (these include the others as special cases)
In particular, I’m working under the toy meta-model that models are programs that output a probability distribution over bitstreams; these are their predictions. You measure reality (producing some actual bitstream) and adjust the probability of each of the models according to the probability they gave for that bitstream, using Bayes’ theorem.
3) I may have misused the term. I mean, the cost in entropy to produce that precise bit-stream. Starting from a random bitstream, how many measurements do you have to use to turn it into, say, 1011011100101 with xor operations? One for each bit. Doesn’t matter how many bits there are—you need to measure them all.
When you consider multiple models, you weight them as a function of their information, preferring shorter ones. A.k.a. Occam’s razor. Normally, you reduce the probability by 1⁄2 for each bit required. Pprior(model) ~ 2^-N, and you sum only up to the number of bits of evidence you have. This last clause is a bit of a hack to keep it normalizable (see below)
I drew a comparison of this to temperature, where you have a probability penalty of e^-E/kT on each microstate. You can have any value here because the number of microstates per energy range (the density of states) does not increase exponentially, but usually quadratically, or sometimes less (over short energy ranges, sometimes it is more).
If you follow the analogy back, the number of bitstreams does increase exponentially as a function of length (doubles each bit), so the prior probability penalty for length must be at least as strong as 1⁄2 to avoid infinitely-long programs being preferred. But, you can use a stronger exponential dieoff—let’s say, 2.01^(-N) - and suddenly the distribution is already normalizable with no need for a special hack. What particular value you put in there will be your e^1/kT equivalent in the analogy.
2) I think this is the distinction you are trying to make between the lattice model and the smoker model: in the lattice model, the equations and parameters are defined, whereas in the smoker model, the equations and parameters have to be deduced. Is that right? If so, my previous posts were referring to the smoker-type model.
Your toy meta-model is consistent with what I was thinking when I used the word “model” in my previous comments.
3) I see what you’re saying. If you add complexity to the model, you want to make sure that its improvement in ability is greater than the amount of complexity added. You want to make sure that the model isn’t just “memorizing” the correct results, and that all model complexity comes with some benefit of generalizability.
I don’t think temperature is the right analogy. What you want is to penalize a model that is too generally applicable. Here is a simple case:
The “complexity” you want is the length of its internal representation in unit bits since each element of the internal representation can retain at most one bit of information, and that bit can be arbitrarily reflected by the input. The information loss is the same as the reconstruction error in unit bits since that describes the probability of the model guessing correctly on a given input stream (assuming each bit is independent). Your criterion translates to “minimize reconstruction error + internal representation size”, and this can be done by repeatedly increasing the size of the internal representation until adding one more element reduces reconstruction error by less than one bit.
Well, the real thing is that (again in the toy metamodel) you consider the complete ensemble of smoker-type models and let them fight it out for good scores when compared to the evidence. I guess you can consider this process to be deduction, sure.
3) (in response to the very end) That would be at the point where 1 bit of internal representation costs 1⁄2 of prior probability. If it was ‘minimize (reconstruction error + 2*representation size)’ then that would be a ‘temperature’ half that, where 1 more bit of internal representation costs a factor of 1⁄4 in prior probability. Colder thus corresponds to wanting your models smaller at the expense of accuracy. Sort of backwards from the usual way temperature is used in simulated annealing of MCMC systems.
I see. You’re treating “energy” as the information required to specify a model. Your analogy and your earlier posts make sense now.