adamShimi comments on A Correspondence Theorem in the Maximum Entropy Framework

adamShimi 12 Nov 2020 16:37 UTC
LW: 3 AF: 2
AF
Thanks for this post! I thought at the start that I would need to jump over a lot of the details, but I found your explanations clear. It is also a very intuitive explanation of the maximum entropy principle, that I saw mentioned multiple times but never really understood.
But let’s stretch the story to AI-alignment-style concerns: what if model 2 is using some weird ontology? What if the things the company cares about are easy to express in terms of the features $f_{1}$ , but hard to express in terms of the features $f_{2}$ ?
Wouldn’t the fact that the final model (either $M_{2}$ or $M^{'}$ depending on the case of you theorem) uses $M_{2}$ ’s ontology still be an issue? Or is the intuition that the ontology of $M_{1}$ should correct for the weirdness of the ontology of $M_{2}$ ?
(A warning, however: this intuitive story is not perfect. Even if the combined model makes the same predictions as one of the original models about the data, it can still update differently as we learn new facts.)
Do you have a simple example of that?
However, there are two senses in which it’s nontrivial. First, in the case where the new model incorrectly predicts some feature-values $μ$ encoded in the old model, we’ve explicitly constructed a new model which outperforms the old. It’s even a pretty simple, clean model—just another maximum entropy model.
Are you saying that even if $M_{2}$ doesn’t encode the features of $M_{1}$ , we can build $M^{'}$ that does? So the non-trivial part is that we can still integrate both sets of features into a better predictive model that is still simple?
- johnswentworth 12 Nov 2020 18:48 UTC
  LW: 2 AF: 1
  AF Parent
  Good questions.
  Wouldn’t the fact that the final model (either $M_{2}$ or $M^{'}$ depending on the case of you theorem) uses $M_{2}$ ’s ontology still be an issue? Or is the intuition that the ontology of $M_{1}$ should correct for the weirdness of the ontology of $M_{2}$ ?
  Roughly speaking, the final model “has the $M_{1}$ features in it”, either explicitly or implicity. There may be extra features in $M_{2}$ , which weren’t in $M_{1}$ , but the $M_{1}$ ontology is still “a subset” of the $M_{2}$ ontology.
  But the real answer here is that our intuitions about what-ontologies-should-do don’t work all that well without bringing in causal models, which we haven’t explicitly done in this post. The “features” here are an imperfect stand-in for causal structure, which comes a lot closer to capturing the idea of “ontology”. That said, when we rewrite causal models in the maximum entropy formalism, the causal structure does end up encoded in the “features”.
  Do you have a simple example of that?
  A standard normal distribution has kurtosis (fourth moment) of 3. If we explicitly add a constraint fixing the kurtosis to 3, then the contraint is slack (since it’s already satisfied by the standard normal), and the whole thing is still just a standard normal.
  But if we then add an additional constraint, things can diverge. For instance, we can add a constraint to a standard normal that the sixth moment is 2. I have no idea what the sixth moment of a standard normal is, but it probably isn’t 2, so this will change the distribution—and the new distribution probably won’t have kurtosis of 3. On the other hand, if we add our (sixth moment = 2) constraint to the (std normal + kurtosis = 3) distribution, then we definitely will end up with a kurtosis of 3, because that’s still an explicit constraint.
  Does that make sense?
  Are you saying that even if $M_{2}$ doesn’t encode the features of $M_{1}$ , we can build $M^{'}$ that does? So the non-trivial part is that we can still integrate both sets of features into a better predictive model that is still simple?
  Correct.