To rephrase: I think the default mode for discoveries in interpretability should be “don’t publish” and publishing should happen after a careful weighing of upsides and downsides. Researchers need to train themselves to the unconditional mental motion “do I really want everybody to know about this?”
Yep, that’s a distinct claim from the one I was making. It’s not a crazy position to have as an ideal to strive for, but I’m not confident that it ends up being net-upside in the current regime absent other concurrent changes. Need to think about it more.
To rephrase: I think the default mode for discoveries in interpretability should be “don’t publish” and publishing should happen after a careful weighing of upsides and downsides. Researchers need to train themselves to the unconditional mental motion “do I really want everybody to know about this?”
Yep, that’s a distinct claim from the one I was making. It’s not a crazy position to have as an ideal to strive for, but I’m not confident that it ends up being net-upside in the current regime absent other concurrent changes. Need to think about it more.