Thanks for this post! I thought at the start that I would need to jump over a lot of the details, but I found your explanations clear. It is also a very intuitive explanation of the maximum entropy principle, that I saw mentioned multiple times but never really understood.
But let’s stretch the story to AI-alignment-style concerns: what if model 2 is using some weird ontology? What if the things the company cares about are easy to express in terms of the features f1, but hard to express in terms of the features f2?
Wouldn’t the fact that the final model (either M2 or M′ depending on the case of you theorem) uses M2’s ontology still be an issue? Or is the intuition that the ontology of M1 should correct for the weirdness of the ontology of M2?
(A warning, however: this intuitive story is not perfect. Even if the combined model makes the same predictions as one of the original models about the data, it can still update differently as we learn new facts.)
Do you have a simple example of that?
However, there are two senses in which it’s nontrivial. First, in the case where the new model incorrectly predicts some feature-values μ encoded in the old model, we’ve explicitly constructed a new model which outperforms the old. It’s even a pretty simple, clean model—just another maximum entropy model.
Are you saying that even if M2 doesn’t encode the features of M1, we can build M′ that does? So the non-trivial part is that we can still integrate both sets of features into a better predictive model that is still simple?
Wouldn’t the fact that the final model (either M2 or M′ depending on the case of you theorem) uses M2’s ontology still be an issue? Or is the intuition that the ontology of M1 should correct for the weirdness of the ontology of M2?
Roughly speaking, the final model “has the M1 features in it”, either explicitly or implicity. There may be extra features in M2, which weren’t in M1, but the M1 ontology is still “a subset” of the M2 ontology.
But the real answer here is that our intuitions about what-ontologies-should-do don’t work all that well without bringing in causal models, which we haven’t explicitly done in this post. The “features” here are an imperfect stand-in for causal structure, which comes a lot closer to capturing the idea of “ontology”. That said, when we rewrite causal models in the maximum entropy formalism, the causal structure does end up encoded in the “features”.
Do you have a simple example of that?
A standard normal distribution has kurtosis (fourth moment) of 3. If we explicitly add a constraint fixing the kurtosis to 3, then the contraint is slack (since it’s already satisfied by the standard normal), and the whole thing is still just a standard normal.
But if we then add an additional constraint, things can diverge. For instance, we can add a constraint to a standard normal that the sixth moment is 2. I have no idea what the sixth moment of a standard normal is, but it probably isn’t 2, so this will change the distribution—and the new distribution probably won’t have kurtosis of 3. On the other hand, if we add our (sixth moment = 2) constraint to the (std normal + kurtosis = 3) distribution, then we definitely will end up with a kurtosis of 3, because that’s still an explicit constraint.
Does that make sense?
Are you saying that even if M2 doesn’t encode the features of M1, we can build M′ that does? So the non-trivial part is that we can still integrate both sets of features into a better predictive model that is still simple?
Thanks for this post! I thought at the start that I would need to jump over a lot of the details, but I found your explanations clear. It is also a very intuitive explanation of the maximum entropy principle, that I saw mentioned multiple times but never really understood.
Wouldn’t the fact that the final model (either M2 or M′ depending on the case of you theorem) uses M2’s ontology still be an issue? Or is the intuition that the ontology of M1 should correct for the weirdness of the ontology of M2?
Do you have a simple example of that?
Are you saying that even if M2 doesn’t encode the features of M1, we can build M′ that does? So the non-trivial part is that we can still integrate both sets of features into a better predictive model that is still simple?
Good questions.
Roughly speaking, the final model “has the M1 features in it”, either explicitly or implicity. There may be extra features in M2, which weren’t in M1, but the M1 ontology is still “a subset” of the M2 ontology.
But the real answer here is that our intuitions about what-ontologies-should-do don’t work all that well without bringing in causal models, which we haven’t explicitly done in this post. The “features” here are an imperfect stand-in for causal structure, which comes a lot closer to capturing the idea of “ontology”. That said, when we rewrite causal models in the maximum entropy formalism, the causal structure does end up encoded in the “features”.
A standard normal distribution has kurtosis (fourth moment) of 3. If we explicitly add a constraint fixing the kurtosis to 3, then the contraint is slack (since it’s already satisfied by the standard normal), and the whole thing is still just a standard normal.
But if we then add an additional constraint, things can diverge. For instance, we can add a constraint to a standard normal that the sixth moment is 2. I have no idea what the sixth moment of a standard normal is, but it probably isn’t 2, so this will change the distribution—and the new distribution probably won’t have kurtosis of 3. On the other hand, if we add our (sixth moment = 2) constraint to the (std normal + kurtosis = 3) distribution, then we definitely will end up with a kurtosis of 3, because that’s still an explicit constraint.
Does that make sense?
Correct.