Thanks for elaborating! In so far your assessment is based on in-person interactions, I can’t really comment since I haven’t spoken much with people from Anthropic.
I think there are degrees to believing this meme you refer to, in the sense of “we need an AI of capability level X to learn meaningful things”. And I would guess that many people at Anthropic do believe this weaker version—it’s their stated purpose after all. And for some values of X this statement is clearly true, e.g. learned filters by shallow CNNs trained on MNIST are not interpretable, wheras the filters of deep Inception-style CNNs trained on ImageNet are (mostly) interpretable.
One could argue that parts of interpretabillity do need to happen in a serial manner, e.g. finding out the best way to interpret transformers at all, the recent SoLU finding, or just generally building up knowledge on how to best formalize or go about this whole interpretability business. If that is true, and furthermore interpretability turns out to be an important component in promising alignment proposals, then the question is mostly about what level of X gives you the most information to advance the serial interpretability research in terms of how much other serial budget you burn.
I don’t know whether people at Anthropic believe the above steps or have thought about it in these ways at all but if they did this could possibly explain the difference in policies between you and them?
Thanks for elaborating! In so far your assessment is based on in-person interactions, I can’t really comment since I haven’t spoken much with people from Anthropic.
I think there are degrees to believing this meme you refer to, in the sense of “we need an AI of capability level X to learn meaningful things”. And I would guess that many people at Anthropic do believe this weaker version—it’s their stated purpose after all. And for some values of X this statement is clearly true, e.g. learned filters by shallow CNNs trained on MNIST are not interpretable, wheras the filters of deep Inception-style CNNs trained on ImageNet are (mostly) interpretable.
One could argue that parts of interpretabillity do need to happen in a serial manner, e.g. finding out the best way to interpret transformers at all, the recent SoLU finding, or just generally building up knowledge on how to best formalize or go about this whole interpretability business. If that is true, and furthermore interpretability turns out to be an important component in promising alignment proposals, then the question is mostly about what level of X gives you the most information to advance the serial interpretability research in terms of how much other serial budget you burn.
I don’t know whether people at Anthropic believe the above steps or have thought about it in these ways at all but if they did this could possibly explain the difference in policies between you and them?