I didn’t impose any structure in the objective/loss function relating to the label. The loss function is just the regular VAE loss. All I did was detach the gradients in some places. So it is a bit surprising to me that this simple of a modification can cause the internals to specialize in this way. After I had seen gradient routing work in other experiments, I predicted that it would work here, but I don’t think gradient routing working was a priori obvious (meaning that I would get zero new information by running an experiment since I predicted it with p=1).
I agree, but the point I’m making is that you had to know the labels in order to know where to detach the gradient. So it’s kind of like making something interpretable by imposing your interpretation on it, which I feel is tautological
For the record I’m excited by gradient routing, and I don’t want to come across as a downer, but this application doesn’t compel me
Edit: Here’s an intuition pump. Would you be similarly excited by having 10 different autoencoders which each reconstruct a single digit, then stitching them together into a single global autoencoder? Because conceptually that seems like what you’re doing
I disagree that this is the same as just stitching together different autoencoders. Presumably the encoder has some shared computation before specializing at the encoding level. I also don’t see how you could use 10 different autoencoders to classify an image from the encodings. I guess you could just look at the reconstruction loss and then the autoencoder which got the lowest loss would probably correspond to the label, but that seems different to what I’m doing.
However, I agree that this application is not useful. I shared it because I (and others) thought it was cool. It’s not really practical at all.
Hope this addresses your question :)
I see, if you disagree w the characterization then I’ve likely misunderstood what you were doing in this post, in which case I no longer endorse the above statements. Thanks for clarifying!
I didn’t impose any structure in the objective/loss function relating to the label. The loss function is just the regular VAE loss. All I did was detach the gradients in some places. So it is a bit surprising to me that this simple of a modification can cause the internals to specialize in this way. After I had seen gradient routing work in other experiments, I predicted that it would work here, but I don’t think gradient routing working was a priori obvious (meaning that I would get zero new information by running an experiment since I predicted it with p=1).
I agree, but the point I’m making is that you had to know the labels in order to know where to detach the gradient. So it’s kind of like making something interpretable by imposing your interpretation on it, which I feel is tautological
For the record I’m excited by gradient routing, and I don’t want to come across as a downer, but this application doesn’t compel me
Edit: Here’s an intuition pump. Would you be similarly excited by having 10 different autoencoders which each reconstruct a single digit, then stitching them together into a single global autoencoder? Because conceptually that seems like what you’re doing
I disagree that this is the same as just stitching together different autoencoders. Presumably the encoder has some shared computation before specializing at the encoding level. I also don’t see how you could use 10 different autoencoders to classify an image from the encodings. I guess you could just look at the reconstruction loss and then the autoencoder which got the lowest loss would probably correspond to the label, but that seems different to what I’m doing. However, I agree that this application is not useful. I shared it because I (and others) thought it was cool. It’s not really practical at all. Hope this addresses your question :)
I see, if you disagree w the characterization then I’ve likely misunderstood what you were doing in this post, in which case I no longer endorse the above statements. Thanks for clarifying!