1) any non-trivial Density of States, especially for semiconductors for the van Hove singularities.
2) I don’t mean a model like ‘consider an FCC lattice populated by one of 10 types of atoms. Here are the transition rates...’ such that the model is made of microstates and you need to do statistics to get probabilities out. I mean a model more like ‘Each cigarette smoked increases the annual risk of lung cancer by 0.001%’ so the output is simply a distribution over outcomes, naturally (these include the others as special cases)
In particular, I’m working under the toy meta-model that models are programs that output a probability distribution over bitstreams; these are their predictions. You measure reality (producing some actual bitstream) and adjust the probability of each of the models according to the probability they gave for that bitstream, using Bayes’ theorem.
3) I may have misused the term. I mean, the cost in entropy to produce that precise bit-stream. Starting from a random bitstream, how many measurements do you have to use to turn it into, say, 1011011100101 with xor operations? One for each bit. Doesn’t matter how many bits there are—you need to measure them all.
When you consider multiple models, you weight them as a function of their information, preferring shorter ones. A.k.a. Occam’s razor. Normally, you reduce the probability by 1⁄2 for each bit required. Pprior(model) ~ 2^-N, and you sum only up to the number of bits of evidence you have. This last clause is a bit of a hack to keep it normalizable (see below)
I drew a comparison of this to temperature, where you have a probability penalty of e^-E/kT on each microstate. You can have any value here because the number of microstates per energy range (the density of states) does not increase exponentially, but usually quadratically, or sometimes less (over short energy ranges, sometimes it is more).
If you follow the analogy back, the number of bitstreams does increase exponentially as a function of length (doubles each bit), so the prior probability penalty for length must be at least as strong as 1⁄2 to avoid infinitely-long programs being preferred. But, you can use a stronger exponential dieoff—let’s say, 2.01^(-N) - and suddenly the distribution is already normalizable with no need for a special hack. What particular value you put in there will be your e^1/kT equivalent in the analogy.
2) I think this is the distinction you are trying to make between the lattice model and the smoker model: in the lattice model, the equations and parameters are defined, whereas in the smoker model, the equations and parameters have to be deduced. Is that right? If so, my previous posts were referring to the smoker-type model.
Your toy meta-model is consistent with what I was thinking when I used the word “model” in my previous comments.
3) I see what you’re saying. If you add complexity to the model, you want to make sure that its improvement in ability is greater than the amount of complexity added. You want to make sure that the model isn’t just “memorizing” the correct results, and that all model complexity comes with some benefit of generalizability.
I don’t think temperature is the right analogy. What you want is to penalize a model that is too generally applicable. Here is a simple case:
simple case A one-hidden-layer feed-forward binary stochastic neural network the goal of which is to find binary-vector representations of its binary-vector inputs. It translates its input to an internal representation of length n, then translates that internal representation into some binary-vector output that is the same length as its input. The error function is the reconstruction error, measured as the KL-divergence from input to output.
The “complexity” you want is the length of its internal representation in unit bits since each element of the internal representation can retain at most one bit of information, and that bit can be arbitrarily reflected by the input. The information loss is the same as the reconstruction error in unit bits since that describes the probability of the model guessing correctly on a given input stream (assuming each bit is independent). Your criterion translates to “minimize reconstruction error + internal representation size”, and this can be done by repeatedly increasing the size of the internal representation until adding one more element reduces reconstruction error by less than one bit.
2) I think this is the distinction you are trying to make between the lattice model and the smoker model: in the lattice model, the equations and parameters are defined, whereas in the smoker model, the equations and parameters have to be deduced. Is that right? If so, my previous posts were referring to the smoker-type model.
Well, the real thing is that (again in the toy metamodel) you consider the complete ensemble of smoker-type models and let them fight it out for good scores when compared to the evidence. I guess you can consider this process to be deduction, sure.
3) (in response to the very end) That would be at the point where 1 bit of internal representation costs 1⁄2 of prior probability. If it was ‘minimize (reconstruction error + 2*representation size)’ then that would be a ‘temperature’ half that, where 1 more bit of internal representation costs a factor of 1⁄4 in prior probability. Colder thus corresponds to wanting your models smaller at the expense of accuracy. Sort of backwards from the usual way temperature is used in simulated annealing of MCMC systems.
1) any non-trivial Density of States, especially for semiconductors for the van Hove singularities.
2) I don’t mean a model like ‘consider an FCC lattice populated by one of 10 types of atoms. Here are the transition rates...’ such that the model is made of microstates and you need to do statistics to get probabilities out. I mean a model more like ‘Each cigarette smoked increases the annual risk of lung cancer by 0.001%’ so the output is simply a distribution over outcomes, naturally (these include the others as special cases)
In particular, I’m working under the toy meta-model that models are programs that output a probability distribution over bitstreams; these are their predictions. You measure reality (producing some actual bitstream) and adjust the probability of each of the models according to the probability they gave for that bitstream, using Bayes’ theorem.
3) I may have misused the term. I mean, the cost in entropy to produce that precise bit-stream. Starting from a random bitstream, how many measurements do you have to use to turn it into, say, 1011011100101 with xor operations? One for each bit. Doesn’t matter how many bits there are—you need to measure them all.
When you consider multiple models, you weight them as a function of their information, preferring shorter ones. A.k.a. Occam’s razor. Normally, you reduce the probability by 1⁄2 for each bit required. Pprior(model) ~ 2^-N, and you sum only up to the number of bits of evidence you have. This last clause is a bit of a hack to keep it normalizable (see below)
I drew a comparison of this to temperature, where you have a probability penalty of e^-E/kT on each microstate. You can have any value here because the number of microstates per energy range (the density of states) does not increase exponentially, but usually quadratically, or sometimes less (over short energy ranges, sometimes it is more).
If you follow the analogy back, the number of bitstreams does increase exponentially as a function of length (doubles each bit), so the prior probability penalty for length must be at least as strong as 1⁄2 to avoid infinitely-long programs being preferred. But, you can use a stronger exponential dieoff—let’s say, 2.01^(-N) - and suddenly the distribution is already normalizable with no need for a special hack. What particular value you put in there will be your e^1/kT equivalent in the analogy.
2) I think this is the distinction you are trying to make between the lattice model and the smoker model: in the lattice model, the equations and parameters are defined, whereas in the smoker model, the equations and parameters have to be deduced. Is that right? If so, my previous posts were referring to the smoker-type model.
Your toy meta-model is consistent with what I was thinking when I used the word “model” in my previous comments.
3) I see what you’re saying. If you add complexity to the model, you want to make sure that its improvement in ability is greater than the amount of complexity added. You want to make sure that the model isn’t just “memorizing” the correct results, and that all model complexity comes with some benefit of generalizability.
I don’t think temperature is the right analogy. What you want is to penalize a model that is too generally applicable. Here is a simple case:
The “complexity” you want is the length of its internal representation in unit bits since each element of the internal representation can retain at most one bit of information, and that bit can be arbitrarily reflected by the input. The information loss is the same as the reconstruction error in unit bits since that describes the probability of the model guessing correctly on a given input stream (assuming each bit is independent). Your criterion translates to “minimize reconstruction error + internal representation size”, and this can be done by repeatedly increasing the size of the internal representation until adding one more element reduces reconstruction error by less than one bit.
Well, the real thing is that (again in the toy metamodel) you consider the complete ensemble of smoker-type models and let them fight it out for good scores when compared to the evidence. I guess you can consider this process to be deduction, sure.
3) (in response to the very end) That would be at the point where 1 bit of internal representation costs 1⁄2 of prior probability. If it was ‘minimize (reconstruction error + 2*representation size)’ then that would be a ‘temperature’ half that, where 1 more bit of internal representation costs a factor of 1⁄4 in prior probability. Colder thus corresponds to wanting your models smaller at the expense of accuracy. Sort of backwards from the usual way temperature is used in simulated annealing of MCMC systems.
I see. You’re treating “energy” as the information required to specify a model. Your analogy and your earlier posts make sense now.