My mental filter for inclusion on that list was apparent prevalence of the “we can’t do alignment until we have an AGI in front of us” meme. If a researcher has that meme and their host org is committed to not advancing the public capabilities frontier, that does ameliorate the damage, and Anthropic does seem to me to be doing the best on that front (hooray for Anthropic!). That said, my impression is that folks at Anthropic are making the tradeoffs differently from how I would, and my guess is that this is in part due to differences in our models of what’s needed for alignment, in a fashion related to the topic of the OP.
Thanks for elaborating! In so far your assessment is based on in-person interactions, I can’t really comment since I haven’t spoken much with people from Anthropic.
I think there are degrees to believing this meme you refer to, in the sense of “we need an AI of capability level X to learn meaningful things”. And I would guess that many people at Anthropic do believe this weaker version—it’s their stated purpose after all. And for some values of X this statement is clearly true, e.g. learned filters by shallow CNNs trained on MNIST are not interpretable, wheras the filters of deep Inception-style CNNs trained on ImageNet are (mostly) interpretable.
One could argue that parts of interpretabillity do need to happen in a serial manner, e.g. finding out the best way to interpret transformers at all, the recent SoLU finding, or just generally building up knowledge on how to best formalize or go about this whole interpretability business. If that is true, and furthermore interpretability turns out to be an important component in promising alignment proposals, then the question is mostly about what level of X gives you the most information to advance the serial interpretability research in terms of how much other serial budget you burn.
I don’t know whether people at Anthropic believe the above steps or have thought about it in these ways at all but if they did this could possibly explain the difference in policies between you and them?
I disagree with that view, primarily due to my belief that the sharp left turn is an old remnant of the hard-takeoff view that was always physically problematic, and now that we actually have AI in a limited form, while there does seem to be a discontinuity at first, rarely will it get you all the way, and once we lose that first instance, progress is much more smooth and slow. So slow-takeoff is my mainline scenario for AI.
Finally, I think we will ultimately have to experiment, because being very blunt, humans are quite bad at reasoning from first principles or priors, and without feedback from reality, reasoning like a formal mathematician or first principles tends to be wildly wrong for real life.
My mental filter for inclusion on that list was apparent prevalence of the “we can’t do alignment until we have an AGI in front of us” meme. If a researcher has that meme and their host org is committed to not advancing the public capabilities frontier, that does ameliorate the damage, and Anthropic does seem to me to be doing the best on that front (hooray for Anthropic!). That said, my impression is that folks at Anthropic are making the tradeoffs differently from how I would, and my guess is that this is in part due to differences in our models of what’s needed for alignment, in a fashion related to the topic of the OP.
Thanks for elaborating! In so far your assessment is based on in-person interactions, I can’t really comment since I haven’t spoken much with people from Anthropic.
I think there are degrees to believing this meme you refer to, in the sense of “we need an AI of capability level X to learn meaningful things”. And I would guess that many people at Anthropic do believe this weaker version—it’s their stated purpose after all. And for some values of X this statement is clearly true, e.g. learned filters by shallow CNNs trained on MNIST are not interpretable, wheras the filters of deep Inception-style CNNs trained on ImageNet are (mostly) interpretable.
One could argue that parts of interpretabillity do need to happen in a serial manner, e.g. finding out the best way to interpret transformers at all, the recent SoLU finding, or just generally building up knowledge on how to best formalize or go about this whole interpretability business. If that is true, and furthermore interpretability turns out to be an important component in promising alignment proposals, then the question is mostly about what level of X gives you the most information to advance the serial interpretability research in terms of how much other serial budget you burn.
I don’t know whether people at Anthropic believe the above steps or have thought about it in these ways at all but if they did this could possibly explain the difference in policies between you and them?
I disagree with that view, primarily due to my belief that the sharp left turn is an old remnant of the hard-takeoff view that was always physically problematic, and now that we actually have AI in a limited form, while there does seem to be a discontinuity at first, rarely will it get you all the way, and once we lose that first instance, progress is much more smooth and slow. So slow-takeoff is my mainline scenario for AI.
Finally, I think we will ultimately have to experiment, because being very blunt, humans are quite bad at reasoning from first principles or priors, and without feedback from reality, reasoning like a formal mathematician or first principles tends to be wildly wrong for real life.