I do not think AI alignment research is hopeless, my personal probability of doom from AI is something like 25%. My frame is definitely not “death with dignity;” I’m thinking about how to win.
I think there’s a lot of empirical research that can be done to reduce risks, including e.g. interpretability, “sandwiching” style projects, adversarial training, testing particular approaches to theoretical problems like eliciting latent knowledge, the evaluations you linked to, empirically demonstrating issues like deception (as you suggested), and more. Lots of groups are in fact working on such problems, and I’m happy about that.
The specific thing that I said was hard to name is realistic and plausible experiments we could do on today’s models that would a) make me update strongly toward “racing forward with plain HFDT will not lead to an AI takeover”, and b) that I think people who disagree with my claim would accept as “fair.” I gave an example right after that of a type of experiment I don’t expect ML people to consider “fair” as a test of this hypothesis. If I saw that ML people could consistently predict the direction in which gradient descent is suboptimal, I would update a lot against this risk.
I think there’s a lot more room for empirical progress where you assume that this is a real risk and try to address it than there is for empirical progress that could realistically cause either skeptics or concerned people to update about whether there’s any risk at all. A forthcoming post by Holden gets into some of these things.
(By the way, I was delighted to find out you’re working on that ARC project that I had linked! Keep it up! You are like in the top 0.001% of world-helping-people.)
I want to clarify two things:
I do not think AI alignment research is hopeless, my personal probability of doom from AI is something like 25%. My frame is definitely not “death with dignity;” I’m thinking about how to win.
I think there’s a lot of empirical research that can be done to reduce risks, including e.g. interpretability, “sandwiching” style projects, adversarial training, testing particular approaches to theoretical problems like eliciting latent knowledge, the evaluations you linked to, empirically demonstrating issues like deception (as you suggested), and more. Lots of groups are in fact working on such problems, and I’m happy about that.
The specific thing that I said was hard to name is realistic and plausible experiments we could do on today’s models that would a) make me update strongly toward “racing forward with plain HFDT will not lead to an AI takeover”, and b) that I think people who disagree with my claim would accept as “fair.” I gave an example right after that of a type of experiment I don’t expect ML people to consider “fair” as a test of this hypothesis. If I saw that ML people could consistently predict the direction in which gradient descent is suboptimal, I would update a lot against this risk.
I think there’s a lot more room for empirical progress where you assume that this is a real risk and try to address it than there is for empirical progress that could realistically cause either skeptics or concerned people to update about whether there’s any risk at all. A forthcoming post by Holden gets into some of these things.
(By the way, I was delighted to find out you’re working on that ARC project that I had linked! Keep it up! You are like in the top 0.001% of world-helping-people.)
Thanks, but I’m not working on that project! That project is led by Beth Barnes.