Thanks for the answer Liam! I especially liked the further context on the connection between Bayesian posteriors and SGD. Below a few more comments on some of your answers:
The partition function is equal to the model evidence Zn=p(Dn), yep. It isn’t equal to p((Yi)|(Xi)),(I assume i is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior),
p(Dn)=∫Wφ(w)n∏i=1p(yi,xi|w)dw
and then under this supervised learning setup where we know q(xi), we have p(yi,xi|w)=p(yi|xi,w)q(xi). Also note that this does “factor over i” (if I’m interpreting you correctly) since the data is independent and identically distributed.
I think I still disagree. I think everything in these formulas needs to be conditioned on the X-part of the dataset. In particular, I think the notation p(Dn) is slightly misleading, but maybe I’m missing something here.
I’ll walk you through my reasoning: When I write (Xi) or (Yi), I mean the whole vectors, e.g., (Xi)i=1,…,n. Then I think the posterior compuation works as follows:
That is just Bayes rule, conditioned on (Xi) in every term. Then, p(w∣(Xi))=φ(w) because from Xalone you don’t get any new information about the conditional q(Y∣X) (A more formal way to see this is to write down the Bayesian network of the model and to see that w and Xi are d-separated). Also, conditioned on w, p is independent over data points, and so we obtain
p(w∣Dn)=1p((Yi)∣(Xi))⋅e−nLn(w)⋅φ(w).
So, comparing with your equations, we must have Zn=p((Yi)∣(Xi)). Do you think this is correct?
Btw., I still don’t think this “factors over i”. I think that
Zn≠∏ni=1p(Yi∣Xi).
The reason is that old data points should inform the parameter w, which should have an influence on future updates. I think the independence assumption only holds for the true distribution and the model conditioned on w.
Thanks for the answer Liam! I especially liked the further context on the connection between Bayesian posteriors and SGD. Below a few more comments on some of your answers:
I think I still disagree. I think everything in these formulas needs to be conditioned on the X-part of the dataset. In particular, I think the notation p(Dn) is slightly misleading, but maybe I’m missing something here.
I’ll walk you through my reasoning: When I write (Xi) or (Yi), I mean the whole vectors, e.g., (Xi)i=1,…,n. Then I think the posterior compuation works as follows:
p(w∣Dn)=p(w∣(Yi),(Xi))=p((Yi)∣(Xi),w)⋅p(w∣(Xi))p((Yi)∣(Xi)).That is just Bayes rule, conditioned on (Xi) in every term. Then, p(w∣(Xi))=φ(w) because from Xalone you don’t get any new information about the conditional q(Y∣X) (A more formal way to see this is to write down the Bayesian network of the model and to see that w and Xi are d-separated). Also, conditioned on w, p is independent over data points, and so we obtain
p(w∣Dn)=1p((Yi)∣(Xi))⋅e−nLn(w)⋅φ(w).So, comparing with your equations, we must have Zn=p((Yi)∣(Xi)). Do you think this is correct?
Btw., I still don’t think this “factors over i”. I think that
Zn≠∏ni=1p(Yi∣Xi).
The reason is that old data points should inform the parameter w, which should have an influence on future updates. I think the independence assumption only holds for the true distribution and the model conditioned on w.
Right. that makes sense, thank you! (I think you missed a factor of n/2, but that doesn’t change the conclusion)
Thanks also for the corrected volume formula, it makes sense now :)