I’m curious if you have guesses about how many singular dimensions were dead neurons (or neurons that are “mostly dead,” only activating for a tiny fraction of the training set), versus how much the zero-gradient directions depended dynamically on training example.
I’m curious if you have guesses about how many singular dimensions were dead neurons (or neurons that are “mostly dead,” only activating for a tiny fraction of the training set), versus how much the zero-gradient directions depended dynamically on training example.