davidad comments on Practical Pitfalls of Causal Scrubbing

davidad 27 Mar 2023 22:06 UTC
3 points
0
Note, assuming the test/validation distribution is an empirical dataset (i.e. a finite mixture of Dirac deltas), and the original graph $G$ is deterministic, the $D_{K L}$ of the pushforward distributions on the outputs of the computational graph will typically be infinite. In this context you would need to use a Wasserstein divergence, or to “thicken” the distributions by adding absolutely-continuous noise to the input and/or output.

Or maybe you meant in cases where the output is a softmax layer and interpreted as a probability distribution, in which case $E_{x} D_{K L} (I (x) | | G (x))$ does seem reasonable. Which does seem like a special case of the following sentence where you suggest using the original loss function but substituting the unablated model for the supervision targets—that also seems like a good summary statistic to look at.
- Lucius Bushnaq 28 Mar 2023 4:40 UTC
  1 point
  0
  Parent
  Second paragraph is what I meant, thanks.