Leon Lang comments on DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks

Leon Lang 27 Jun 2023 1:32 UTC
1 point
0
Thanks for the answer Liam! I especially liked the further context on the connection between Bayesian posteriors and SGD. Below a few more comments on some of your answers:
The partition function is equal to the model evidence $Z_{n} = p (D_{n})$ , yep. It isn’t equal to $p ((Y_{i}) | (X_{i})),$ (I assume $i$ is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior),
$p (D_{n}) = \int_{W} φ (w) n \prod i = 1 p (y_{i}, x_{i} | w) d w$
and then under this supervised learning setup where we know $q (x_{i})$ , we have $p (y_{i}, x_{i} | w) = p (y_{i} | x_{i}, w) q (x_{i})$ . Also note that this does “factor over $i$ ” (if I’m interpreting you correctly) since the data is independent and identically distributed.
I think I still disagree. I think everything in these formulas needs to be conditioned on the $X$ -part of the dataset. In particular, I think the notation $p (D_{n})$ is slightly misleading, but maybe I’m missing something here.
I’ll walk you through my reasoning: When I write $(X_{i})$ or $(Y_{i})$ , I mean the whole vectors, e.g., $(X_{i})_{i = 1, \dots, n}$ . Then I think the posterior compuation works as follows:
$p (w ∣ D_{n}) = p (w ∣ (Y_{i}), (X_{i})) = \frac{p ((Y_{i}) ∣ (X_{i}), w) \cdot p (w ∣ (X_{i}))}{p ((Y_{i}) ∣ (X_{i}))} .$
That is just Bayes rule, conditioned on $(X_{i})$ in every term. Then, $p (w ∣ (X_{i})) = φ (w)$ because from $X$ alone you don’t get any new information about the conditional $q (Y ∣ X)$ (A more formal way to see this is to write down the Bayesian network of the model and to see that $w$ and $X_{i}$ are d-separated). Also, conditioned on $w$ , $p$ is independent over data points, and so we obtain
$p (w ∣ D_{n}) = \frac{1}{p ((Y_{i}) ∣ (X_{i}))} \cdot e^{- n L_{n} (w)} \cdot φ (w) .$
So, comparing with your equations, we must have $Z_{n} = p ((Y_{i}) ∣ (X_{i})) .$ Do you think this is correct?
Btw., I still don’t think this “factors over $i$ ”. I think that
$Z_{n} \neq \prod_{i = 1}^{n} p (Y_{i} ∣ X_{i}) .$
The reason is that old data points should inform the parameter $w$ , which should have an influence on future updates. I think the independence assumption only holds for the true distribution and the model conditioned on $w$ .
If you expand that term out you find that
$\begin{matrix} \int_{W} (w - w_{0})^{T} \frac{\partial φ}{\partial w} {∣ ∣}_{w = w_{0}} exp (- (w - w_{0})^{T} I (w_{0}) (w - w_{0})) d w = \frac{\partial φ}{\partial w} {∣ ∣}_{w = w_{0}} \int_{W} (w - w_{0})^{T} exp (- (w - w_{0})^{T} I (w_{0}) (w - w_{0})) d w = 0 \end{matrix}$
because the second integral is the first central moment of a Gaussian. The derivative of the prior is irrelevant.
Right. that makes sense, thank you! (I think you missed a factor of $n / 2$ , but that doesn’t change the conclusion)
Thanks also for the corrected volume formula, it makes sense now :)