Lucius Bushnaq comments on Practical Pitfalls of Causal Scrubbing

Lucius Bushnaq 27 Mar 2023 13:45 UTC
1 point
0
Your suggestion of using $D_{K L}$ seems a useful improvement compared to most metrics. It’s, however, still possible that cancellation could occur. Cancellation is mostly due to aggregating over a metric (e.g., the mean) and less due to the specific metric used (although I could imagine that some metrics like $D_{K L}$ could allow for less ambiguity).
It’s not about $D_{K L}$ vs. some other loss function. It’s about using a one dimensional summary of a high dimensional comparison, instead of a one dimensional comparison. There are many ways for two neural networks to both diverge from some training labels $y$ by an average loss $l$ while spitting out very different outputs. There are tautologically no ways for two neural networks to have different output behaviour without having non-zero divergence in label assignment for at least some data points. Thus, it seems that you would want a metric that aggregates the divergence of the two networks’ outputs from each other, not a metric that compares their separate aggregated divergences from some unrelated data labels and so throws away most of the information.
A low dimensional summary of a high dimensional comparison between the networks seems fine(ish). A low dimensional comparison between the networks based on the summaries of their separate comparisons to a third distribution throws away a lot of the relevant information.