I appreciate you writing this! It really helped me get a more concrete sense of what it is like for new alignment researchers (like me) to be aiming to make good things happen.
Owain asked us not to publish this for fear of capabilities improvements
Note that “capabilities improvements” can mean a lot of things here. The first thing that comes to mind is that publicizing this differentially accelerates the amount of damage API users could do with access to SOTA LLMs, which makes sense to me. It also makes sense to me that Owain would consider publishing this idea not worth the downside, simply because there’s not much benefit to publicizing this, for alignment researchers and capabilities researchers, off the top of my head. OpenAI capabilities people probably have already tried such experiments internally and know of this, and alignment researchers probably wouldn’t be able to build on top of this finding (here I mostly have interpretability researchers in mind).
Sometimes I needed more information to work on a task, but I tended to assume this was my fault. If I were smarter, I would be able to do it, so I didn’t want to bother Joseph for more information. I now realise this is silly—whether the fault is mine or not, if I need more context to solve a problem, I need more context, and it helps nobody to delay asking about this too much.
Oh yeah I have had this issue many (but not all of the) times with mentors in the past. I suggest not simply trying to rationalize that emotion away though, and perhaps try to actually debug it. “Whether the fault is mine or not”, sure but if my brain tracks whether I am an asset or a liability to the project, then my brain is giving me important information in the form of my emotions.
Anyway, I’m glad you now have a job in the alignment field!
I appreciate you writing this! It really helped me get a more concrete sense of what it is like for new alignment researchers (like me) to be aiming to make good things happen.
Note that “capabilities improvements” can mean a lot of things here. The first thing that comes to mind is that publicizing this differentially accelerates the amount of damage API users could do with access to SOTA LLMs, which makes sense to me. It also makes sense to me that Owain would consider publishing this idea not worth the downside, simply because there’s not much benefit to publicizing this, for alignment researchers and capabilities researchers, off the top of my head. OpenAI capabilities people probably have already tried such experiments internally and know of this, and alignment researchers probably wouldn’t be able to build on top of this finding (here I mostly have interpretability researchers in mind).
Oh yeah I have had this issue many (but not all of the) times with mentors in the past. I suggest not simply trying to rationalize that emotion away though, and perhaps try to actually debug it. “Whether the fault is mine or not”, sure but if my brain tracks whether I am an asset or a liability to the project, then my brain is giving me important information in the form of my emotions.
Anyway, I’m glad you now have a job in the alignment field!