If alignment is about getting models to do what you want and not engaging in certain negative behavior, then researching how to get models to censor certain outputs could theoretically produce insights for alignment.
I was referred by 80k Hours to talk to a manager on the OpenAI safety team who argued exactly this to me. I didn’t join, so no idea to what extent it makes sense vs. just being a nice-sounding idea.
I was referred by 80k Hours to talk to a manager on the OpenAI safety team who argued exactly this to me. I didn’t join, so no idea to what extent it makes sense vs. just being a nice-sounding idea.