First, I totally agree, and in general think that if the AI risk community can’t successfully adopt a norm of “don’t advance capabilities research” we basically can’t really collaborate on anything.
I’m saddened to hear you mention Anthropic. I was holding out hope after SBF invested and looking at their site that they were the only present AGI company that was serious about not advancing the public state of capabilities research. Was that naive? Where do they fall along the Operational Adequacy spectrum?
I’d also be interested in hearing which parts of Anthropic’s research output you think burns our serial time budget. If I understood the post correctly, then OP thinks that efforts like transformer circuits are mostly about accelerating parallelizable research.
Maybe OP thinks that
mechanistic interpretability does have little value in terms of serial research
RLHF does not give us alignment (because it doesn’t generalize beyond the “sharp left turn” which OP thinks is likely to happen)
therefore, since most of Anthropic’s alignment focused output has not much value in terms of serial research, and it does somewhat enhance present-day LLM capabilities/usability, it is net negative?
But I’m very much unsure whether OP really believes this—would love to hear him elaborate.
ETA:
It could also be the case that OP was exclusively referring to the part of Anthropic that is about training LLMs efficiently as a pre-requisite to study those models?
My mental filter for inclusion on that list was apparent prevalence of the “we can’t do alignment until we have an AGI in front of us” meme. If a researcher has that meme and their host org is committed to not advancing the public capabilities frontier, that does ameliorate the damage, and Anthropic does seem to me to be doing the best on that front (hooray for Anthropic!). That said, my impression is that folks at Anthropic are making the tradeoffs differently from how I would, and my guess is that this is in part due to differences in our models of what’s needed for alignment, in a fashion related to the topic of the OP.
Thanks for elaborating! In so far your assessment is based on in-person interactions, I can’t really comment since I haven’t spoken much with people from Anthropic.
I think there are degrees to believing this meme you refer to, in the sense of “we need an AI of capability level X to learn meaningful things”. And I would guess that many people at Anthropic do believe this weaker version—it’s their stated purpose after all. And for some values of X this statement is clearly true, e.g. learned filters by shallow CNNs trained on MNIST are not interpretable, wheras the filters of deep Inception-style CNNs trained on ImageNet are (mostly) interpretable.
One could argue that parts of interpretabillity do need to happen in a serial manner, e.g. finding out the best way to interpret transformers at all, the recent SoLU finding, or just generally building up knowledge on how to best formalize or go about this whole interpretability business. If that is true, and furthermore interpretability turns out to be an important component in promising alignment proposals, then the question is mostly about what level of X gives you the most information to advance the serial interpretability research in terms of how much other serial budget you burn.
I don’t know whether people at Anthropic believe the above steps or have thought about it in these ways at all but if they did this could possibly explain the difference in policies between you and them?
I disagree with that view, primarily due to my belief that the sharp left turn is an old remnant of the hard-takeoff view that was always physically problematic, and now that we actually have AI in a limited form, while there does seem to be a discontinuity at first, rarely will it get you all the way, and once we lose that first instance, progress is much more smooth and slow. So slow-takeoff is my mainline scenario for AI.
Finally, I think we will ultimately have to experiment, because being very blunt, humans are quite bad at reasoning from first principles or priors, and without feedback from reality, reasoning like a formal mathematician or first principles tends to be wildly wrong for real life.
First, I totally agree, and in general think that if the AI risk community can’t successfully adopt a norm of “don’t advance capabilities research” we basically can’t really collaborate on anything.
I’m saddened to hear you mention Anthropic. I was holding out hope after SBF invested and looking at their site that they were the only present AGI company that was serious about not advancing the public state of capabilities research. Was that naive? Where do they fall along the Operational Adequacy spectrum?
I’d also be interested in hearing which parts of Anthropic’s research output you think burns our serial time budget. If I understood the post correctly, then OP thinks that efforts like transformer circuits are mostly about accelerating parallelizable research.
Maybe OP thinks that
mechanistic interpretability does have little value in terms of serial research
RLHF does not give us alignment (because it doesn’t generalize beyond the “sharp left turn” which OP thinks is likely to happen)
therefore, since most of Anthropic’s alignment focused output has not much value in terms of serial research, and it does somewhat enhance present-day LLM capabilities/usability, it is net negative?
But I’m very much unsure whether OP really believes this—would love to hear him elaborate.
ETA: It could also be the case that OP was exclusively referring to the part of Anthropic that is about training LLMs efficiently as a pre-requisite to study those models?
My mental filter for inclusion on that list was apparent prevalence of the “we can’t do alignment until we have an AGI in front of us” meme. If a researcher has that meme and their host org is committed to not advancing the public capabilities frontier, that does ameliorate the damage, and Anthropic does seem to me to be doing the best on that front (hooray for Anthropic!). That said, my impression is that folks at Anthropic are making the tradeoffs differently from how I would, and my guess is that this is in part due to differences in our models of what’s needed for alignment, in a fashion related to the topic of the OP.
Thanks for elaborating! In so far your assessment is based on in-person interactions, I can’t really comment since I haven’t spoken much with people from Anthropic.
I think there are degrees to believing this meme you refer to, in the sense of “we need an AI of capability level X to learn meaningful things”. And I would guess that many people at Anthropic do believe this weaker version—it’s their stated purpose after all. And for some values of X this statement is clearly true, e.g. learned filters by shallow CNNs trained on MNIST are not interpretable, wheras the filters of deep Inception-style CNNs trained on ImageNet are (mostly) interpretable.
One could argue that parts of interpretabillity do need to happen in a serial manner, e.g. finding out the best way to interpret transformers at all, the recent SoLU finding, or just generally building up knowledge on how to best formalize or go about this whole interpretability business. If that is true, and furthermore interpretability turns out to be an important component in promising alignment proposals, then the question is mostly about what level of X gives you the most information to advance the serial interpretability research in terms of how much other serial budget you burn.
I don’t know whether people at Anthropic believe the above steps or have thought about it in these ways at all but if they did this could possibly explain the difference in policies between you and them?
I disagree with that view, primarily due to my belief that the sharp left turn is an old remnant of the hard-takeoff view that was always physically problematic, and now that we actually have AI in a limited form, while there does seem to be a discontinuity at first, rarely will it get you all the way, and once we lose that first instance, progress is much more smooth and slow. So slow-takeoff is my mainline scenario for AI.
Finally, I think we will ultimately have to experiment, because being very blunt, humans are quite bad at reasoning from first principles or priors, and without feedback from reality, reasoning like a formal mathematician or first principles tends to be wildly wrong for real life.