On its own, this’d be another metric that doesn’t track the right scale as models become more powerful.
The same KL-div in GPT-2 and GPT-4 probably corresponds to the destruction of far more of the internal structure in the latter than the former.
Destroy 95% of GPT-2′s circuits, and the resulting output distribution may look quite different. Destroy 95% of GPT-4′s circuits, and the resulting output distribution may not be all that different, since 5% of the circuits in GPT-4 might still be enough to get a lot of the most common token prediction cases roughly right.
What are your thoughts on KL-div after the unembed softmax as a metric?
On its own, this’d be another metric that doesn’t track the right scale as models become more powerful.
The same KL-div in GPT-2 and GPT-4 probably corresponds to the destruction of far more of the internal structure in the latter than the former.
Destroy 95% of GPT-2′s circuits, and the resulting output distribution may look quite different. Destroy 95% of GPT-4′s circuits, and the resulting output distribution may not be all that different, since 5% of the circuits in GPT-4 might still be enough to get a lot of the most common token prediction cases roughly right.
I don’t see important differences between that and ce loss delta in the context Lucius is describing