My sense is that Anthropic is somewhat oriented around this idea. I’m not sure if this is their actual plan or just some guesswork I read between the lines.
But I vaguely recall something like “develop capabilities that you don’t publish, while also developing interpretability techniques which you do publish, and try to have a competitive edge on capabilities which you then have some lead time to try to inspect via intepretability techniques and the practice alignment on various capability-scales.
(I may have just made this up while trying to steelman them to myself)
My sense is that Anthropic is somewhat oriented around this idea. I’m not sure if this is their actual plan or just some guesswork I read between the lines.
But I vaguely recall something like “develop capabilities that you don’t publish, while also developing interpretability techniques which you do publish, and try to have a competitive edge on capabilities which you then have some lead time to try to inspect via intepretability techniques and the practice alignment on various capability-scales.
(I may have just made this up while trying to steelman them to myself)