However, I was surprised to find that the datapoints the network misclassified on the training data are evenly distributed across the D* spectrum. I would have expected them to all have low D* didn’t learn them.
My first intuition here was that the misclassified data points where the network just tried to use the learned features and just got it wrong, rather than those being points the network didn’t bother to learn? Like say a 2 that looks a lot like an 8 so to the network it looks like a middle-of-the-spectrum 8? Not sure if this is sensible.
The shape of D* changes very little between initialization and the final training run.
I think this is actually a big hint that a lot of the stuff we see in those plots might be not what we think it is / an illusion. Any shape present at initialization cannot tell us anything about the trained network. More on this later.
the distribution of errors is actually left-heavy which is exactly the opposite of what we would expect
Okay this would be much easier if you collapsed the x-axis of those line plots and made it a histogram (the x axis is just sorted index right?), then you could make the dots also into histograms.
we would think that especially weird examples are more likely to be misclassified, i.e. examples on the right-hand side of the spectrum
So are we sure that weird examples are on the right-hand side? If I take weird examples to just trigger a random set of features, would I expect this to have a high or low dimensionality? Given that the normal case is 1e-3 to 1e-2, what’s the random chance value?
We train models from scratch to 1,2,3,8,18 and 40 iterations and plot D*, the location of all misclassified datapoints and a histogram over the misclassification rate per bin.
This seems to suggest the left-heavy distribution might actually be due to initialization too? The left-tail seems to decline a lot after a couple of training iterations.
I think one of the key checks for this metric will be ironing out which apparent effects are just initialization. Those nice line plots look suggestive, but if initialization produces the same image we can’t be sure what we can learn.
One idea to get traction here would be: Run the same experiment with different seeds, do the same plot of max data dim by index, then take the two sorted lists of indices and scatter-plot them. If this looks somewhat linear there might be some real reason why some data points require more dimensions. If it just looks random that would be evidence against inherently difficult/complicated data points that the network memorizes / ignores every time.
Edit: Some evidence for this is actually that the 1s tend to be systematically at the right of the curve, so there seems to be some inherent effect to the data!
Thanks Marius for this great write-up!
My first intuition here was that the misclassified data points where the network just tried to use the learned features and just got it wrong, rather than those being points the network didn’t bother to learn? Like say a 2 that looks a lot like an 8 so to the network it looks like a middle-of-the-spectrum 8? Not sure if this is sensible.
I think this is actually a big hint that a lot of the stuff we see in those plots might be not what we think it is / an illusion. Any shape present at initialization cannot tell us anything about the trained network. More on this later.
Okay this would be much easier if you collapsed the x-axis of those line plots and made it a histogram (the x axis is just sorted index right?), then you could make the dots also into histograms.
So are we sure that weird examples are on the right-hand side? If I take weird examples to just trigger a random set of features, would I expect this to have a high or low dimensionality? Given that the normal case is 1e-3 to 1e-2, what’s the random chance value?
This seems to suggest the left-heavy distribution might actually be due to initialization too? The left-tail seems to decline a lot after a couple of training iterations.
I think one of the key checks for this metric will be ironing out which apparent effects are just initialization. Those nice line plots look suggestive, but if initialization produces the same image we can’t be sure what we can learn.
One idea to get traction here would be: Run the same experiment with different seeds, do the same plot of max data dim by index, then take the two sorted lists of indices and scatter-plot them. If this looks somewhat linear there might be some real reason why some data points require more dimensions. If it just looks random that would be evidence against inherently difficult/complicated data points that the network memorizes / ignores every time.
Edit: Some evidence for this is actually that the 1s tend to be systematically at the right of the curve, so there seems to be some inherent effect to the data!