jacquesthibs comments on Some background for reasoning about dual-use alignment research

jacquesthibs 20 May 2023 0:06 UTC
8 points
3
I agree with the main points made in the post, though I want to recognize there is some difficulty that comes with predicting which aspects will drive capability advances. I think there is value in reading papers (something that more alignment researchers should probably do) because it can give us hints at the next capability leaps. Over time, I think it can improve our intuition for what lies ahead and allows us to better predict the order of capability advances. This is how I’ve felt as I’ve been pursuing the Accelerating Alignment agenda (language model systems for accelerating alignment research). I’ve been at the forefront, reading Twitter/papers/etc to find insights into how to use language models for research and feel like I’ve been gaining a lot of intuition into where the field is going.
As you said, it’s also important to remember that most of the field isn’t directly aiming for AGI. Safety discussions, particularly about self-improvement and similar topics, may have inspired some individuals to consider pursuing directions useful for AGI, when they might not have otherwise. This is why some people will say things like, “AI safety has been net negative and AGI safety discussions have shortened AGI timelines”. I think there is some truth to the timelines argument, but it’s not clear it has been net negative, in my opinion. There’s a point at which AI Safety work must be done and investment must be made in AGI safety.
One concern I’d like to bring up as a point of discussion is that whether infohazard policies could backfire. By withholding certain insights, these policies may leave safety researchers in the dark about the field’s trajectory, while capability researchers are engaged in active discussions. Some of us were aware about the AgentGPT-like models likely happening soon (though unsure about the exact date), but it seems to have blindsided a lot of people concerned about alignment. It’s possible that safety researchers could be blindsided again by rapid developments they were not privy to due to infohazard policies.
This may have been manageable when progress was slower, but now, with global attention on AI, it may lead to some infohazard policies backfiring, particularly due to alignment people not being able to react as quickly as they should. I think most of the balance favours keeping infohazard policies as is for now, but this was a thought I had earlier this week and figured I would share.
What links here?
- jacquesthibs's comment on Thoughts on sharing information about language model capabilities by paulfchristiano (31 Jul 2023 18:35 UTC; 4 points)