johnswentworth comments on A Correspondence Theorem in the Maximum Entropy Framework

johnswentworth 12 Nov 2020 18:48 UTC
LW: 2 AF: 1
AF
Good questions.
Wouldn’t the fact that the final model (either $M_{2}$ or $M^{'}$ depending on the case of you theorem) uses $M_{2}$ ’s ontology still be an issue? Or is the intuition that the ontology of $M_{1}$ should correct for the weirdness of the ontology of $M_{2}$ ?
Roughly speaking, the final model “has the $M_{1}$ features in it”, either explicitly or implicity. There may be extra features in $M_{2}$ , which weren’t in $M_{1}$ , but the $M_{1}$ ontology is still “a subset” of the $M_{2}$ ontology.
But the real answer here is that our intuitions about what-ontologies-should-do don’t work all that well without bringing in causal models, which we haven’t explicitly done in this post. The “features” here are an imperfect stand-in for causal structure, which comes a lot closer to capturing the idea of “ontology”. That said, when we rewrite causal models in the maximum entropy formalism, the causal structure does end up encoded in the “features”.
Do you have a simple example of that?
A standard normal distribution has kurtosis (fourth moment) of 3. If we explicitly add a constraint fixing the kurtosis to 3, then the contraint is slack (since it’s already satisfied by the standard normal), and the whole thing is still just a standard normal.
But if we then add an additional constraint, things can diverge. For instance, we can add a constraint to a standard normal that the sixth moment is 2. I have no idea what the sixth moment of a standard normal is, but it probably isn’t 2, so this will change the distribution—and the new distribution probably won’t have kurtosis of 3. On the other hand, if we add our (sixth moment = 2) constraint to the (std normal + kurtosis = 3) distribution, then we definitely will end up with a kurtosis of 3, because that’s still an explicit constraint.
Does that make sense?
Are you saying that even if $M_{2}$ doesn’t encode the features of $M_{1}$ , we can build $M^{'}$ that does? So the non-trivial part is that we can still integrate both sets of features into a better predictive model that is still simple?
Correct.