This post tries to identify the possible cases for highly reliable agent design (HRAD) work to be the main priority of AI alignment. HRAD is a category of work at MIRI that aims to build a theory of intelligence and agency that can explain things like logical uncertainty and counterfactual reasoning.
The first case for HRAD work is that by becoming less confused about these phenomena, we will be able to help AGI builders predict, explain, avoid, detect, and fix safety issues and help to conceptually clarify the AI alignment problem. For this purpose, we just need _conceptual_ deconfusion—it isn’t necessary that there must be precise equations defining what an AI system does.
The second case is that if we get a precise, mathematical theory, we can use it to build an agent that we understand “from the ground up”, rather than throwing the black box of deep learning at the problem.
The last case is that by understanding how intelligence works will give us a theory that allows us to predict how _arbitrary_ agents will behave, which will be useful for AI alignment in all the ways described in the first case and <@more@>(@Theory of Ideal Agents, or of Existing Agents?@).
Looking through past discussion on the topic, the author believes that people at MIRI primarily believe in the first two cases. Meanwhile, critics (particularly me) say that it seems pretty unlikely that we can build a precise, mathematical theory, and a more conceptual but imprecise theory may help us understand reasoning better but is less likely to generalize sufficiently well to say important and non-trivial things about AI alignment for the systems we are actually building.
Planned opinion:
I like this post—it seems like an accessible summary of the state of the debate so far. My opinions are already in the post, so I don’t have much to add.
Planned summary for the Alignment Newsletter:
Planned opinion: