Wouldn’t the fact that the final model (either M2 or M′ depending on the case of you theorem) uses M2’s ontology still be an issue? Or is the intuition that the ontology of M1 should correct for the weirdness of the ontology of M2?
Roughly speaking, the final model “has the M1 features in it”, either explicitly or implicity. There may be extra features in M2, which weren’t in M1, but the M1 ontology is still “a subset” of the M2 ontology.
But the real answer here is that our intuitions about what-ontologies-should-do don’t work all that well without bringing in causal models, which we haven’t explicitly done in this post. The “features” here are an imperfect stand-in for causal structure, which comes a lot closer to capturing the idea of “ontology”. That said, when we rewrite causal models in the maximum entropy formalism, the causal structure does end up encoded in the “features”.
Do you have a simple example of that?
A standard normal distribution has kurtosis (fourth moment) of 3. If we explicitly add a constraint fixing the kurtosis to 3, then the contraint is slack (since it’s already satisfied by the standard normal), and the whole thing is still just a standard normal.
But if we then add an additional constraint, things can diverge. For instance, we can add a constraint to a standard normal that the sixth moment is 2. I have no idea what the sixth moment of a standard normal is, but it probably isn’t 2, so this will change the distribution—and the new distribution probably won’t have kurtosis of 3. On the other hand, if we add our (sixth moment = 2) constraint to the (std normal + kurtosis = 3) distribution, then we definitely will end up with a kurtosis of 3, because that’s still an explicit constraint.
Does that make sense?
Are you saying that even if M2 doesn’t encode the features of M1, we can build M′ that does? So the non-trivial part is that we can still integrate both sets of features into a better predictive model that is still simple?
Good questions.
Roughly speaking, the final model “has the M1 features in it”, either explicitly or implicity. There may be extra features in M2, which weren’t in M1, but the M1 ontology is still “a subset” of the M2 ontology.
But the real answer here is that our intuitions about what-ontologies-should-do don’t work all that well without bringing in causal models, which we haven’t explicitly done in this post. The “features” here are an imperfect stand-in for causal structure, which comes a lot closer to capturing the idea of “ontology”. That said, when we rewrite causal models in the maximum entropy formalism, the causal structure does end up encoded in the “features”.
A standard normal distribution has kurtosis (fourth moment) of 3. If we explicitly add a constraint fixing the kurtosis to 3, then the contraint is slack (since it’s already satisfied by the standard normal), and the whole thing is still just a standard normal.
But if we then add an additional constraint, things can diverge. For instance, we can add a constraint to a standard normal that the sixth moment is 2. I have no idea what the sixth moment of a standard normal is, but it probably isn’t 2, so this will change the distribution—and the new distribution probably won’t have kurtosis of 3. On the other hand, if we add our (sixth moment = 2) constraint to the (std normal + kurtosis = 3) distribution, then we definitely will end up with a kurtosis of 3, because that’s still an explicit constraint.
Does that make sense?
Correct.