I think there are approximately zero people actively trying to take actions which, according to their own world model, are likely to lead to the destruction of the world. As such, I think it’s probably helpful on the margin to publish stuff of the form “model internals are surprisingly interpretable, and if you want to know if your language model is plotting to overthrow humanity there will probably be tells, here’s where you might want to look”. More generally “you can and should get better at figuring out what’s going on inside models, rather than treating them as black boxes” is probably a good norm to have.
I could see the argument against, for example if you think “LLMs are a dead end on the path to AGI, so the only impact of improvements to their robustness is increasing their usefulness at helping to design the recursively self-improving GOFAI that will ultimately end up taking over the world” or “there exists some group of alignment researchers that is on track to solve both capabilities and alignment such that they can take over the world and prevent anyone else from ending it” or even “people who thing about alignment are likely to have unusually strong insights about capabilities, relative to people who think mostly about capabilities”.
I’m not aware of any arguments that alignment researchers specifically should refrain from publishing that don’t have some pretty specific upstream assumptions like the above though.
I think there are approximately zero people actively trying to take actions which, according to their own world model, are likely to lead to the destruction of the world. As such, I think it’s probably helpful on the margin to publish stuff of the form “model internals are surprisingly interpretable, and if you want to know if your language model is plotting to overthrow humanity there will probably be tells, here’s where you might want to look”. More generally “you can and should get better at figuring out what’s going on inside models, rather than treating them as black boxes” is probably a good norm to have.
I could see the argument against, for example if you think “LLMs are a dead end on the path to AGI, so the only impact of improvements to their robustness is increasing their usefulness at helping to design the recursively self-improving GOFAI that will ultimately end up taking over the world” or “there exists some group of alignment researchers that is on track to solve both capabilities and alignment such that they can take over the world and prevent anyone else from ending it” or even “people who thing about alignment are likely to have unusually strong insights about capabilities, relative to people who think mostly about capabilities”.
I’m not aware of any arguments that alignment researchers specifically should refrain from publishing that don’t have some pretty specific upstream assumptions like the above though.