In this post, Daniel Filan presents an analytic perspective on how to do useful AI alignment research. His take is that in a world with powerful AGI systems similar to neural networks, it may be sufficient to be able to detect whether a system would cause bad outcomes before you deploy it on real-world systems with unknown distributions. To this end, he advocates for work on transparency that gives <@mechanistic understandings@><@Mechanistic Transparency for Machine Learning@> of the systems in question, combined with foundational research that allows us to reason about the safety of the produced understandings.
My opinion:
My broad take is that I agree that analyzing neural nets is useful and more work should go into it, but I broadly disagree that this leads to reduced x-risk by increasing the likelihood that developers can look at their trained model, determine whether it is dangerous by understanding it mechanistically, and decide whether to deploy it, in a “zero-shot” way. The key difficulty here is the mechanistic transparency, which seems like far too strong a property for us to aim for given empirical facts about the world (though perhaps normatively it would be better if humanity didn’t deploy powerful AI until it was mechanistically understood).
The main intuition here is that the difficulty of mechanistic transparency increases at least linearly and probably superlinearly with model complexity. Combine that with the fact that right now for e.g. image classifiers, some people on OpenAI’s Clarity team have spent multiple years understanding a single image classifier, which is orders of magnitude more expensive than training the classifier, and it seems quite unlikely that we could have mechanistic transparency for even more complex AGI systems built out of neural nets. More details in this comment. Note that Daniel agrees that it is an open question whether this sort of mechanistic transparency is possible, and thinks that we don’t have much evidence yet that it isn’t.
Asya’s summary for the Alignment Newsletter:
My opinion: