Rohin Shah comments on [AN #98]: Understanding neural net training by seeing which gradients were helpful

Rohin Shah 6 May 2020 20:22 UTC
LW: 8 AF: 5
AF
I’d be interested in seeing experiments in which we start with the version of LCA where everything is negative, and made only one of the changes. This would allow us to narrow down which particular change causes a given effect, kind of like an ablation study.
Fwiw, I think this is one of the easier papers to replicate in deep learning, and so would make a great starter project for someone trying to get into deep learning and/or AI safety. I also think the resulting analysis could be publishable at a top ML conference.
You might worry about whether this differentially advantages safety or capabilities. My view is that improved understanding of deep learning is positive for the world (see here), I also think that enough people who have thought about the problem agree with me that you shouldn’t worry about the unilateralist’s curse. But there are people who would argue for the opposite position too.